1. Introduction

This is a comprehensive Exploratory Data Analysis for the Web Traffic Time Series Forecasting competition.

1.1 Load libraries

In [1]:
import pandas as pd
import numpy as np
from dfply import *
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model as lm
from plotnine import *
from datetime import datetime
from pyramid.arima import auto_arima
import math

1.2 Load data files

Note, that the key_1.csv data is not small with about 700 MB and for the purpose of this exploration we only read a few

In [2]:
train= pd.read_csv('Files/train_1.csv')
key= pd.read_csv('Files/key_1.csv',nrows=5)

1.3 File structure and content

Dimension of train and CSV

In [3]:
train
Out[3]:
Page 2015-07-01 2015-07-02 2015-07-03 2015-07-04 2015-07-05 2015-07-06 2015-07-07 2015-07-08 2015-07-09 ... 2016-12-22 2016-12-23 2016-12-24 2016-12-25 2016-12-26 2016-12-27 2016-12-28 2016-12-29 2016-12-30 2016-12-31
0 2NE1_zh.wikipedia.org_all-access_spider 18.0 11.0 5.0 13.0 14.0 9.0 9.0 22.0 26.0 ... 32.0 63.0 15.0 26.0 14.0 20.0 22.0 19.0 18.0 20.0
1 2PM_zh.wikipedia.org_all-access_spider 11.0 14.0 15.0 18.0 11.0 13.0 22.0 11.0 10.0 ... 17.0 42.0 28.0 15.0 9.0 30.0 52.0 45.0 26.0 20.0
2 3C_zh.wikipedia.org_all-access_spider 1.0 0.0 1.0 1.0 0.0 4.0 0.0 3.0 4.0 ... 3.0 1.0 1.0 7.0 4.0 4.0 6.0 3.0 4.0 17.0
3 4minute_zh.wikipedia.org_all-access_spider 35.0 13.0 10.0 94.0 4.0 26.0 14.0 9.0 11.0 ... 32.0 10.0 26.0 27.0 16.0 11.0 17.0 19.0 10.0 11.0
4 52_Hz_I_Love_You_zh.wikipedia.org_all-access_s... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 48.0 9.0 25.0 13.0 3.0 11.0 27.0 13.0 36.0 10.0
5 5566_zh.wikipedia.org_all-access_spider 12.0 7.0 4.0 5.0 20.0 8.0 5.0 17.0 24.0 ... 16.0 27.0 8.0 17.0 32.0 19.0 23.0 17.0 17.0 50.0
6 91Days_zh.wikipedia.org_all-access_spider NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 2.0 7.0 33.0 8.0 11.0 4.0 15.0 6.0 8.0 6.0
7 A'N'D_zh.wikipedia.org_all-access_spider 118.0 26.0 30.0 24.0 29.0 127.0 53.0 37.0 20.0 ... 64.0 35.0 35.0 28.0 20.0 23.0 32.0 39.0 32.0 17.0
8 AKB48_zh.wikipedia.org_all-access_spider 5.0 23.0 14.0 12.0 9.0 9.0 35.0 15.0 14.0 ... 34.0 105.0 72.0 36.0 33.0 30.0 36.0 38.0 31.0 97.0
9 ASCII_zh.wikipedia.org_all-access_spider 6.0 3.0 5.0 12.0 6.0 5.0 4.0 13.0 9.0 ... 25.0 17.0 22.0 29.0 30.0 29.0 35.0 44.0 26.0 41.0
10 ASTRO_zh.wikipedia.org_all-access_spider NaN NaN NaN NaN NaN 1.0 1.0 NaN NaN ... 11.0 38.0 85.0 79.0 30.0 14.0 10.0 38.0 12.0 51.0
11 Ahq_e-Sports_Club_zh.wikipedia.org_all-access_... 2.0 1.0 4.0 4.0 2.0 6.0 3.0 6.0 9.0 ... 8.0 17.0 18.0 48.0 19.0 14.0 9.0 23.0 11.0 7.0
12 All_your_base_are_belong_to_us_zh.wikipedia.or... 2.0 5.0 5.0 1.0 3.0 3.0 5.0 3.0 17.0 ... 5.0 4.0 4.0 5.0 2.0 9.0 7.0 4.0 5.0 0.0
13 AlphaGo_zh.wikipedia.org_all-access_spider NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 14.0 13.0 14.0 17.0 19.0 56.0 21.0 13.0 21.0 11.0
14 Android_zh.wikipedia.org_all-access_spider 8.0 27.0 9.0 25.0 25.0 10.0 34.0 22.0 17.0 ... 36.0 36.0 46.0 42.0 40.0 40.0 66.0 43.0 38.0 74.0
15 Angelababy_zh.wikipedia.org_all-access_spider 40.0 17.0 25.0 42.0 41.0 7.0 18.0 21.0 33.0 ... 27.0 40.0 26.0 30.0 68.0 31.0 77.0 42.0 111.0 37.0
16 Apink_zh.wikipedia.org_all-access_spider 61.0 33.0 21.0 10.0 26.0 11.0 39.0 195.0 62.0 ... 14.0 24.0 35.0 34.0 24.0 34.0 28.0 44.0 12.0 31.0
17 Apple_II_zh.wikipedia.org_all-access_spider 4.0 8.0 4.0 9.0 7.0 4.0 15.0 9.0 17.0 ... 5.0 14.0 8.0 11.0 8.0 24.0 10.0 15.0 12.0 11.0
18 As_One_zh.wikipedia.org_all-access_spider 13.0 7.0 14.0 11.0 20.0 5.0 32.0 11.0 6.0 ... 37.0 12.0 7.0 11.0 13.0 17.0 13.0 12.0 9.0 8.0
19 B-PROJECT_zh.wikipedia.org_all-access_spider NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.0 26.0 10.0 5.0 5.0 11.0 10.0 4.0 8.0 6.0
20 B1A4_zh.wikipedia.org_all-access_spider 22.0 11.0 23.0 10.0 6.0 12.0 74.0 17.0 38.0 ... 43.0 23.0 52.0 60.0 14.0 19.0 38.0 30.0 21.0 24.0
21 BDSM_zh.wikipedia.org_all-access_spider 25.0 3.0 3.0 4.0 12.0 14.0 16.0 15.0 22.0 ... 12.0 18.0 23.0 17.0 20.0 19.0 20.0 38.0 21.0 16.0
22 BEAST_zh.wikipedia.org_all-access_spider 19.0 6.0 12.0 14.0 13.0 7.0 12.0 64.0 9.0 ... 11.0 13.0 20.0 30.0 16.0 24.0 47.0 26.0 13.0 13.0
23 BIGBANG_zh.wikipedia.org_all-access_spider 23.0 24.0 31.0 9.0 21.0 27.0 15.0 8.0 50.0 ... 85.0 63.0 80.0 29.0 37.0 40.0 104.0 39.0 32.0 34.0
24 BLACK_PINK_zh.wikipedia.org_all-access_spider NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 32.0 56.0 39.0 65.0 78.0 143.0 96.0 63.0 28.0 75.0
25 BLEACH_zh.wikipedia.org_all-access_spider 11.0 5.0 13.0 8.0 6.0 5.0 8.0 5.0 12.0 ... 16.0 13.0 14.0 15.0 14.0 21.0 15.0 16.0 15.0 28.0
26 BTOB_zh.wikipedia.org_all-access_spider 22.0 67.0 26.0 34.0 38.0 13.0 17.0 33.0 43.0 ... 19.0 17.0 14.0 17.0 28.0 23.0 32.0 37.0 42.0 60.0
27 Beautiful_Mind_zh.wikipedia.org_all-access_spider NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 11.0 8.0 6.0 7.0 2.0 11.0 11.0 29.0 12.0 14.0
28 Beyond_zh.wikipedia.org_all-access_spider 291.0 64.0 26.0 20.0 28.0 6.0 20.0 10.0 48.0 ... 23.0 35.0 53.0 35.0 20.0 29.0 40.0 28.0 39.0 75.0
29 Big_zh.wikipedia.org_all-access_spider 3.0 53.0 11.0 3.0 4.0 3.0 11.0 9.0 5.0 ... 11.0 20.0 9.0 13.0 7.0 17.0 13.0 20.0 19.0 13.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
145033 Sin_senos_sí_hay_paraíso_es.wikipedia.org_all-... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 19.0 40.0 22.0 28.0 13.0 53.0 12.0 32.0 11.0 62.0
145034 Anexo:Medallero_de_los_Juegos_Olímpicos_de_Río... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 3.0 9.0 1.0 9.0 5.0 5.0 5.0 8.0 5.0 2.0
145035 Arrival_(película)_es.wikipedia.org_all-access... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 10.0 5.0 14.0 8.0 54.0 18.0 46.0 70.0 15.0 6.0
145036 Anexo:Baloncesto_en_los_Juegos_Olímpicos_de_Rí... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.0 1.0 2.0 5.0 5.0 1.0 0.0 1.0 1.0 1.0
145037 Hasta_que_te_conocí_(serie_de_televisión)_es.w... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 18.0 8.0 3.0 4.0 31.0 5.0 1.0 2.0 0.0 2.0
145038 Westworld_(serie_de_televisión)_es.wikipedia.o... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 7.0 23.0 28.0 31.0 26.0 12.0 13.0 12.0 9.0 10.0
145039 Milénico_es.wikipedia.org_all-access_spider NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 14.0 46.0 51.0 11.0 11.0 14.0 26.0 13.0 12.0 7.0
145040 Moonlight_(película)_es.wikipedia.org_all-acce... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 3.0 4.0 13.0 10.0 2.0 4.0 4.0 3.0 1.0 2.0
145041 Sully_(película)_es.wikipedia.org_all-access_s... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 10.0 9.0 8.0 37.0 7.0 5.0 9.0 7.0 10.0 4.0
145042 Pulsaciones_(serie_de_televisión)_es.wikipedia... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.0 2.0 23.0 11.0 4.0 25.0 2.0 14.0 2.0 14.0
145043 2091_(serie_de_televisión)_es.wikipedia.org_al... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4.0 7.0 3.0 4.0 4.0 2.0 4.0 5.0 2.0 2.0
145044 Campeonato_Sudamericano_de_Fútbol_Sub-20_de_20... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 2.0 7.0 8.0 20.0 27.0 11.0 7.0 17.0 13.0 40.0
145045 Split_(película)_es.wikipedia.org_all-access_s... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 2.0 1.0 2.0 0.0 0.0 1.0 1.0 1.0 0.0 1.0
145046 Huracán_Matthew_es.wikipedia.org_all-access_sp... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 6.0 4.0 4.0 6.0 5.0 5.0 13.0 7.0 11.0 7.0
145047 Fences_(película)_es.wikipedia.org_all-access_... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 5.0 3.0 10.0 1.0 6.0 22.0 34.0 1.0 3.0 29.0
145048 Logan_(película)_es.wikipedia.org_all-access_s... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 26.0 25.0 7.0 5.0 8.0 25.0 2.0 8.0 3.0 1.0
145049 La_doña_(telenovela_de_2016)_es.wikipedia.org_... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 6.0 35.0 10.0 3.0 4.0 1.0 31.0 27.0 9.0 135.0
145050 RTS_(canal_de_televisión)_es.wikipedia.org_all... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.0 2.0 7.0 2.0 3.0 2.0 18.0 40.0 1.0 42.0
145051 La_ley_del_corazón_es.wikipedia.org_all-access... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 22.0 74.0 222.0 2.0 16.0 21.0 7.0 34.0 37.0 42.0
145052 The_Crown_(serie_de_televisión)_es.wikipedia.o... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 5.0 83.0 44.0 36.0 9.0 4.0 17.0 6.0 11.0 5.0
145053 Drake_(músico)_es.wikipedia.org_all-access_spider NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 13.0 7.0 6.0 3.0 3.0 8.0 21.0 14.0 24.0 37.0
145054 Skam_(serie_de_televisión)_es.wikipedia.org_al... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 8.0 9.0 9.0 19.0 17.0 7.0 13.0 12.0 31.0 11.0
145055 Legión_(serie_de_televisión)_es.wikipedia.org_... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.0 2.0 1.0 1.0 3.0 4.0 2.0 4.0 4.0 3.0
145056 Doble_tentación_es.wikipedia.org_all-access_sp... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 21.0 NaN NaN NaN NaN NaN NaN NaN NaN 51.0
145057 Mi_adorable_maldición_es.wikipedia.org_all-acc... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
145058 Underworld_(serie_de_películas)_es.wikipedia.o... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 13.0 12.0 13.0 3.0 5.0 10.0
145059 Resident_Evil:_Capítulo_Final_es.wikipedia.org... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
145060 Enamorándome_de_Ramón_es.wikipedia.org_all-acc... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
145061 Hasta_el_último_hombre_es.wikipedia.org_all-ac... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
145062 Francisco_el_matemático_(serie_de_televisión_d... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

145063 rows × 551 columns

In [4]:
print(train.shape)
print(key.shape)
print(train.describe)
(145063, 551)
(5, 2)
<bound method NDFrame.describe of                                                      Page  2015-07-01  \
0                 2NE1_zh.wikipedia.org_all-access_spider        18.0   
1                  2PM_zh.wikipedia.org_all-access_spider        11.0   
2                   3C_zh.wikipedia.org_all-access_spider         1.0   
3              4minute_zh.wikipedia.org_all-access_spider        35.0   
4       52_Hz_I_Love_You_zh.wikipedia.org_all-access_s...         NaN   
5                 5566_zh.wikipedia.org_all-access_spider        12.0   
6               91Days_zh.wikipedia.org_all-access_spider         NaN   
7                A'N'D_zh.wikipedia.org_all-access_spider       118.0   
8                AKB48_zh.wikipedia.org_all-access_spider         5.0   
9                ASCII_zh.wikipedia.org_all-access_spider         6.0   
10               ASTRO_zh.wikipedia.org_all-access_spider         NaN   
11      Ahq_e-Sports_Club_zh.wikipedia.org_all-access_...         2.0   
12      All_your_base_are_belong_to_us_zh.wikipedia.or...         2.0   
13             AlphaGo_zh.wikipedia.org_all-access_spider         NaN   
14             Android_zh.wikipedia.org_all-access_spider         8.0   
15          Angelababy_zh.wikipedia.org_all-access_spider        40.0   
16               Apink_zh.wikipedia.org_all-access_spider        61.0   
17            Apple_II_zh.wikipedia.org_all-access_spider         4.0   
18              As_One_zh.wikipedia.org_all-access_spider        13.0   
19           B-PROJECT_zh.wikipedia.org_all-access_spider         NaN   
20                B1A4_zh.wikipedia.org_all-access_spider        22.0   
21                BDSM_zh.wikipedia.org_all-access_spider        25.0   
22               BEAST_zh.wikipedia.org_all-access_spider        19.0   
23             BIGBANG_zh.wikipedia.org_all-access_spider        23.0   
24          BLACK_PINK_zh.wikipedia.org_all-access_spider         NaN   
25              BLEACH_zh.wikipedia.org_all-access_spider        11.0   
26                BTOB_zh.wikipedia.org_all-access_spider        22.0   
27      Beautiful_Mind_zh.wikipedia.org_all-access_spider         NaN   
28              Beyond_zh.wikipedia.org_all-access_spider       291.0   
29                 Big_zh.wikipedia.org_all-access_spider         3.0   
...                                                   ...         ...   
145033  Sin_senos_sí_hay_paraíso_es.wikipedia.org_all-...         NaN   
145034  Anexo:Medallero_de_los_Juegos_Olímpicos_de_Río...         NaN   
145035  Arrival_(película)_es.wikipedia.org_all-access...         NaN   
145036  Anexo:Baloncesto_en_los_Juegos_Olímpicos_de_Rí...         NaN   
145037  Hasta_que_te_conocí_(serie_de_televisión)_es.w...         NaN   
145038  Westworld_(serie_de_televisión)_es.wikipedia.o...         NaN   
145039        Milénico_es.wikipedia.org_all-access_spider         NaN   
145040  Moonlight_(película)_es.wikipedia.org_all-acce...         NaN   
145041  Sully_(película)_es.wikipedia.org_all-access_s...         NaN   
145042  Pulsaciones_(serie_de_televisión)_es.wikipedia...         NaN   
145043  2091_(serie_de_televisión)_es.wikipedia.org_al...         NaN   
145044  Campeonato_Sudamericano_de_Fútbol_Sub-20_de_20...         NaN   
145045  Split_(película)_es.wikipedia.org_all-access_s...         NaN   
145046  Huracán_Matthew_es.wikipedia.org_all-access_sp...         NaN   
145047  Fences_(película)_es.wikipedia.org_all-access_...         NaN   
145048  Logan_(película)_es.wikipedia.org_all-access_s...         NaN   
145049  La_doña_(telenovela_de_2016)_es.wikipedia.org_...         NaN   
145050  RTS_(canal_de_televisión)_es.wikipedia.org_all...         NaN   
145051  La_ley_del_corazón_es.wikipedia.org_all-access...         NaN   
145052  The_Crown_(serie_de_televisión)_es.wikipedia.o...         NaN   
145053  Drake_(músico)_es.wikipedia.org_all-access_spider         NaN   
145054  Skam_(serie_de_televisión)_es.wikipedia.org_al...         NaN   
145055  Legión_(serie_de_televisión)_es.wikipedia.org_...         NaN   
145056  Doble_tentación_es.wikipedia.org_all-access_sp...         NaN   
145057  Mi_adorable_maldición_es.wikipedia.org_all-acc...         NaN   
145058  Underworld_(serie_de_películas)_es.wikipedia.o...         NaN   
145059  Resident_Evil:_Capítulo_Final_es.wikipedia.org...         NaN   
145060  Enamorándome_de_Ramón_es.wikipedia.org_all-acc...         NaN   
145061  Hasta_el_último_hombre_es.wikipedia.org_all-ac...         NaN   
145062  Francisco_el_matemático_(serie_de_televisión_d...         NaN   

        2015-07-02  2015-07-03  2015-07-04  2015-07-05  2015-07-06  \
0             11.0         5.0        13.0        14.0         9.0   
1             14.0        15.0        18.0        11.0        13.0   
2              0.0         1.0         1.0         0.0         4.0   
3             13.0        10.0        94.0         4.0        26.0   
4              NaN         NaN         NaN         NaN         NaN   
5              7.0         4.0         5.0        20.0         8.0   
6              NaN         NaN         NaN         NaN         NaN   
7             26.0        30.0        24.0        29.0       127.0   
8             23.0        14.0        12.0         9.0         9.0   
9              3.0         5.0        12.0         6.0         5.0   
10             NaN         NaN         NaN         NaN         1.0   
11             1.0         4.0         4.0         2.0         6.0   
12             5.0         5.0         1.0         3.0         3.0   
13             NaN         NaN         NaN         NaN         NaN   
14            27.0         9.0        25.0        25.0        10.0   
15            17.0        25.0        42.0        41.0         7.0   
16            33.0        21.0        10.0        26.0        11.0   
17             8.0         4.0         9.0         7.0         4.0   
18             7.0        14.0        11.0        20.0         5.0   
19             NaN         NaN         NaN         NaN         NaN   
20            11.0        23.0        10.0         6.0        12.0   
21             3.0         3.0         4.0        12.0        14.0   
22             6.0        12.0        14.0        13.0         7.0   
23            24.0        31.0         9.0        21.0        27.0   
24             NaN         NaN         NaN         NaN         NaN   
25             5.0        13.0         8.0         6.0         5.0   
26            67.0        26.0        34.0        38.0        13.0   
27             NaN         NaN         NaN         NaN         NaN   
28            64.0        26.0        20.0        28.0         6.0   
29            53.0        11.0         3.0         4.0         3.0   
...            ...         ...         ...         ...         ...   
145033         NaN         NaN         NaN         NaN         NaN   
145034         NaN         NaN         NaN         NaN         NaN   
145035         NaN         NaN         NaN         NaN         NaN   
145036         NaN         NaN         NaN         NaN         NaN   
145037         NaN         NaN         NaN         NaN         NaN   
145038         NaN         NaN         NaN         NaN         NaN   
145039         NaN         NaN         NaN         NaN         NaN   
145040         NaN         NaN         NaN         NaN         NaN   
145041         NaN         NaN         NaN         NaN         NaN   
145042         NaN         NaN         NaN         NaN         NaN   
145043         NaN         NaN         NaN         NaN         NaN   
145044         NaN         NaN         NaN         NaN         NaN   
145045         NaN         NaN         NaN         NaN         NaN   
145046         NaN         NaN         NaN         NaN         NaN   
145047         NaN         NaN         NaN         NaN         NaN   
145048         NaN         NaN         NaN         NaN         NaN   
145049         NaN         NaN         NaN         NaN         NaN   
145050         NaN         NaN         NaN         NaN         NaN   
145051         NaN         NaN         NaN         NaN         NaN   
145052         NaN         NaN         NaN         NaN         NaN   
145053         NaN         NaN         NaN         NaN         NaN   
145054         NaN         NaN         NaN         NaN         NaN   
145055         NaN         NaN         NaN         NaN         NaN   
145056         NaN         NaN         NaN         NaN         NaN   
145057         NaN         NaN         NaN         NaN         NaN   
145058         NaN         NaN         NaN         NaN         NaN   
145059         NaN         NaN         NaN         NaN         NaN   
145060         NaN         NaN         NaN         NaN         NaN   
145061         NaN         NaN         NaN         NaN         NaN   
145062         NaN         NaN         NaN         NaN         NaN   

        2015-07-07  2015-07-08  2015-07-09     ...      2016-12-22  \
0              9.0        22.0        26.0     ...            32.0   
1             22.0        11.0        10.0     ...            17.0   
2              0.0         3.0         4.0     ...             3.0   
3             14.0         9.0        11.0     ...            32.0   
4              NaN         NaN         NaN     ...            48.0   
5              5.0        17.0        24.0     ...            16.0   
6              NaN         NaN         NaN     ...             2.0   
7             53.0        37.0        20.0     ...            64.0   
8             35.0        15.0        14.0     ...            34.0   
9              4.0        13.0         9.0     ...            25.0   
10             1.0         NaN         NaN     ...            11.0   
11             3.0         6.0         9.0     ...             8.0   
12             5.0         3.0        17.0     ...             5.0   
13             NaN         NaN         NaN     ...            14.0   
14            34.0        22.0        17.0     ...            36.0   
15            18.0        21.0        33.0     ...            27.0   
16            39.0       195.0        62.0     ...            14.0   
17            15.0         9.0        17.0     ...             5.0   
18            32.0        11.0         6.0     ...            37.0   
19             NaN         NaN         NaN     ...             4.0   
20            74.0        17.0        38.0     ...            43.0   
21            16.0        15.0        22.0     ...            12.0   
22            12.0        64.0         9.0     ...            11.0   
23            15.0         8.0        50.0     ...            85.0   
24             NaN         NaN         NaN     ...            32.0   
25             8.0         5.0        12.0     ...            16.0   
26            17.0        33.0        43.0     ...            19.0   
27             NaN         NaN         NaN     ...            11.0   
28            20.0        10.0        48.0     ...            23.0   
29            11.0         9.0         5.0     ...            11.0   
...            ...         ...         ...     ...             ...   
145033         NaN         NaN         NaN     ...            19.0   
145034         NaN         NaN         NaN     ...             3.0   
145035         NaN         NaN         NaN     ...            10.0   
145036         NaN         NaN         NaN     ...             1.0   
145037         NaN         NaN         NaN     ...            18.0   
145038         NaN         NaN         NaN     ...             7.0   
145039         NaN         NaN         NaN     ...            14.0   
145040         NaN         NaN         NaN     ...             3.0   
145041         NaN         NaN         NaN     ...            10.0   
145042         NaN         NaN         NaN     ...             1.0   
145043         NaN         NaN         NaN     ...             4.0   
145044         NaN         NaN         NaN     ...             2.0   
145045         NaN         NaN         NaN     ...             2.0   
145046         NaN         NaN         NaN     ...             6.0   
145047         NaN         NaN         NaN     ...             5.0   
145048         NaN         NaN         NaN     ...            26.0   
145049         NaN         NaN         NaN     ...             6.0   
145050         NaN         NaN         NaN     ...             1.0   
145051         NaN         NaN         NaN     ...            22.0   
145052         NaN         NaN         NaN     ...             5.0   
145053         NaN         NaN         NaN     ...            13.0   
145054         NaN         NaN         NaN     ...             8.0   
145055         NaN         NaN         NaN     ...             1.0   
145056         NaN         NaN         NaN     ...            21.0   
145057         NaN         NaN         NaN     ...             0.0   
145058         NaN         NaN         NaN     ...             NaN   
145059         NaN         NaN         NaN     ...             NaN   
145060         NaN         NaN         NaN     ...             NaN   
145061         NaN         NaN         NaN     ...             NaN   
145062         NaN         NaN         NaN     ...             NaN   

        2016-12-23  2016-12-24  2016-12-25  2016-12-26  2016-12-27  \
0             63.0        15.0        26.0        14.0        20.0   
1             42.0        28.0        15.0         9.0        30.0   
2              1.0         1.0         7.0         4.0         4.0   
3             10.0        26.0        27.0        16.0        11.0   
4              9.0        25.0        13.0         3.0        11.0   
5             27.0         8.0        17.0        32.0        19.0   
6              7.0        33.0         8.0        11.0         4.0   
7             35.0        35.0        28.0        20.0        23.0   
8            105.0        72.0        36.0        33.0        30.0   
9             17.0        22.0        29.0        30.0        29.0   
10            38.0        85.0        79.0        30.0        14.0   
11            17.0        18.0        48.0        19.0        14.0   
12             4.0         4.0         5.0         2.0         9.0   
13            13.0        14.0        17.0        19.0        56.0   
14            36.0        46.0        42.0        40.0        40.0   
15            40.0        26.0        30.0        68.0        31.0   
16            24.0        35.0        34.0        24.0        34.0   
17            14.0         8.0        11.0         8.0        24.0   
18            12.0         7.0        11.0        13.0        17.0   
19            26.0        10.0         5.0         5.0        11.0   
20            23.0        52.0        60.0        14.0        19.0   
21            18.0        23.0        17.0        20.0        19.0   
22            13.0        20.0        30.0        16.0        24.0   
23            63.0        80.0        29.0        37.0        40.0   
24            56.0        39.0        65.0        78.0       143.0   
25            13.0        14.0        15.0        14.0        21.0   
26            17.0        14.0        17.0        28.0        23.0   
27             8.0         6.0         7.0         2.0        11.0   
28            35.0        53.0        35.0        20.0        29.0   
29            20.0         9.0        13.0         7.0        17.0   
...            ...         ...         ...         ...         ...   
145033        40.0        22.0        28.0        13.0        53.0   
145034         9.0         1.0         9.0         5.0         5.0   
145035         5.0        14.0         8.0        54.0        18.0   
145036         1.0         2.0         5.0         5.0         1.0   
145037         8.0         3.0         4.0        31.0         5.0   
145038        23.0        28.0        31.0        26.0        12.0   
145039        46.0        51.0        11.0        11.0        14.0   
145040         4.0        13.0        10.0         2.0         4.0   
145041         9.0         8.0        37.0         7.0         5.0   
145042         2.0        23.0        11.0         4.0        25.0   
145043         7.0         3.0         4.0         4.0         2.0   
145044         7.0         8.0        20.0        27.0        11.0   
145045         1.0         2.0         0.0         0.0         1.0   
145046         4.0         4.0         6.0         5.0         5.0   
145047         3.0        10.0         1.0         6.0        22.0   
145048        25.0         7.0         5.0         8.0        25.0   
145049        35.0        10.0         3.0         4.0         1.0   
145050         2.0         7.0         2.0         3.0         2.0   
145051        74.0       222.0         2.0        16.0        21.0   
145052        83.0        44.0        36.0         9.0         4.0   
145053         7.0         6.0         3.0         3.0         8.0   
145054         9.0         9.0        19.0        17.0         7.0   
145055         2.0         1.0         1.0         3.0         4.0   
145056         NaN         NaN         NaN         NaN         NaN   
145057         0.0         NaN         NaN         NaN         NaN   
145058         NaN         NaN         NaN        13.0        12.0   
145059         NaN         NaN         NaN         NaN         NaN   
145060         NaN         NaN         NaN         NaN         NaN   
145061         NaN         NaN         NaN         NaN         NaN   
145062         NaN         NaN         NaN         NaN         NaN   

        2016-12-28  2016-12-29  2016-12-30  2016-12-31  
0             22.0        19.0        18.0        20.0  
1             52.0        45.0        26.0        20.0  
2              6.0         3.0         4.0        17.0  
3             17.0        19.0        10.0        11.0  
4             27.0        13.0        36.0        10.0  
5             23.0        17.0        17.0        50.0  
6             15.0         6.0         8.0         6.0  
7             32.0        39.0        32.0        17.0  
8             36.0        38.0        31.0        97.0  
9             35.0        44.0        26.0        41.0  
10            10.0        38.0        12.0        51.0  
11             9.0        23.0        11.0         7.0  
12             7.0         4.0         5.0         0.0  
13            21.0        13.0        21.0        11.0  
14            66.0        43.0        38.0        74.0  
15            77.0        42.0       111.0        37.0  
16            28.0        44.0        12.0        31.0  
17            10.0        15.0        12.0        11.0  
18            13.0        12.0         9.0         8.0  
19            10.0         4.0         8.0         6.0  
20            38.0        30.0        21.0        24.0  
21            20.0        38.0        21.0        16.0  
22            47.0        26.0        13.0        13.0  
23           104.0        39.0        32.0        34.0  
24            96.0        63.0        28.0        75.0  
25            15.0        16.0        15.0        28.0  
26            32.0        37.0        42.0        60.0  
27            11.0        29.0        12.0        14.0  
28            40.0        28.0        39.0        75.0  
29            13.0        20.0        19.0        13.0  
...            ...         ...         ...         ...  
145033        12.0        32.0        11.0        62.0  
145034         5.0         8.0         5.0         2.0  
145035        46.0        70.0        15.0         6.0  
145036         0.0         1.0         1.0         1.0  
145037         1.0         2.0         0.0         2.0  
145038        13.0        12.0         9.0        10.0  
145039        26.0        13.0        12.0         7.0  
145040         4.0         3.0         1.0         2.0  
145041         9.0         7.0        10.0         4.0  
145042         2.0        14.0         2.0        14.0  
145043         4.0         5.0         2.0         2.0  
145044         7.0        17.0        13.0        40.0  
145045         1.0         1.0         0.0         1.0  
145046        13.0         7.0        11.0         7.0  
145047        34.0         1.0         3.0        29.0  
145048         2.0         8.0         3.0         1.0  
145049        31.0        27.0         9.0       135.0  
145050        18.0        40.0         1.0        42.0  
145051         7.0        34.0        37.0        42.0  
145052        17.0         6.0        11.0         5.0  
145053        21.0        14.0        24.0        37.0  
145054        13.0        12.0        31.0        11.0  
145055         2.0         4.0         4.0         3.0  
145056         NaN         NaN         NaN        51.0  
145057         NaN         NaN         NaN         NaN  
145058        13.0         3.0         5.0        10.0  
145059         NaN         NaN         NaN         NaN  
145060         NaN         NaN         NaN         NaN  
145061         NaN         NaN         NaN         NaN  
145062         NaN         NaN         NaN         NaN  

[145063 rows x 551 columns]>
In [5]:
key
Out[5]:
Page Id
0 !vote_en.wikipedia.org_all-access_all-agents_2... bf4edcf969af
1 !vote_en.wikipedia.org_all-access_all-agents_2... 929ed2bf52b9
2 !vote_en.wikipedia.org_all-access_all-agents_2... ff29d0f51d5c
3 !vote_en.wikipedia.org_all-access_all-agents_2... e98873359be6
4 !vote_en.wikipedia.org_all-access_all-agents_2... fa012434263a
In [6]:
train["Page"]
Out[6]:
0                   2NE1_zh.wikipedia.org_all-access_spider
1                    2PM_zh.wikipedia.org_all-access_spider
2                     3C_zh.wikipedia.org_all-access_spider
3                4minute_zh.wikipedia.org_all-access_spider
4         52_Hz_I_Love_You_zh.wikipedia.org_all-access_s...
5                   5566_zh.wikipedia.org_all-access_spider
6                 91Days_zh.wikipedia.org_all-access_spider
7                  A'N'D_zh.wikipedia.org_all-access_spider
8                  AKB48_zh.wikipedia.org_all-access_spider
9                  ASCII_zh.wikipedia.org_all-access_spider
10                 ASTRO_zh.wikipedia.org_all-access_spider
11        Ahq_e-Sports_Club_zh.wikipedia.org_all-access_...
12        All_your_base_are_belong_to_us_zh.wikipedia.or...
13               AlphaGo_zh.wikipedia.org_all-access_spider
14               Android_zh.wikipedia.org_all-access_spider
15            Angelababy_zh.wikipedia.org_all-access_spider
16                 Apink_zh.wikipedia.org_all-access_spider
17              Apple_II_zh.wikipedia.org_all-access_spider
18                As_One_zh.wikipedia.org_all-access_spider
19             B-PROJECT_zh.wikipedia.org_all-access_spider
20                  B1A4_zh.wikipedia.org_all-access_spider
21                  BDSM_zh.wikipedia.org_all-access_spider
22                 BEAST_zh.wikipedia.org_all-access_spider
23               BIGBANG_zh.wikipedia.org_all-access_spider
24            BLACK_PINK_zh.wikipedia.org_all-access_spider
25                BLEACH_zh.wikipedia.org_all-access_spider
26                  BTOB_zh.wikipedia.org_all-access_spider
27        Beautiful_Mind_zh.wikipedia.org_all-access_spider
28                Beyond_zh.wikipedia.org_all-access_spider
29                   Big_zh.wikipedia.org_all-access_spider
                                ...                        
145033    Sin_senos_sí_hay_paraíso_es.wikipedia.org_all-...
145034    Anexo:Medallero_de_los_Juegos_Olímpicos_de_Río...
145035    Arrival_(película)_es.wikipedia.org_all-access...
145036    Anexo:Baloncesto_en_los_Juegos_Olímpicos_de_Rí...
145037    Hasta_que_te_conocí_(serie_de_televisión)_es.w...
145038    Westworld_(serie_de_televisión)_es.wikipedia.o...
145039          Milénico_es.wikipedia.org_all-access_spider
145040    Moonlight_(película)_es.wikipedia.org_all-acce...
145041    Sully_(película)_es.wikipedia.org_all-access_s...
145042    Pulsaciones_(serie_de_televisión)_es.wikipedia...
145043    2091_(serie_de_televisión)_es.wikipedia.org_al...
145044    Campeonato_Sudamericano_de_Fútbol_Sub-20_de_20...
145045    Split_(película)_es.wikipedia.org_all-access_s...
145046    Huracán_Matthew_es.wikipedia.org_all-access_sp...
145047    Fences_(película)_es.wikipedia.org_all-access_...
145048    Logan_(película)_es.wikipedia.org_all-access_s...
145049    La_doña_(telenovela_de_2016)_es.wikipedia.org_...
145050    RTS_(canal_de_televisión)_es.wikipedia.org_all...
145051    La_ley_del_corazón_es.wikipedia.org_all-access...
145052    The_Crown_(serie_de_televisión)_es.wikipedia.o...
145053    Drake_(músico)_es.wikipedia.org_all-access_spider
145054    Skam_(serie_de_televisión)_es.wikipedia.org_al...
145055    Legión_(serie_de_televisión)_es.wikipedia.org_...
145056    Doble_tentación_es.wikipedia.org_all-access_sp...
145057    Mi_adorable_maldición_es.wikipedia.org_all-acc...
145058    Underworld_(serie_de_películas)_es.wikipedia.o...
145059    Resident_Evil:_Capítulo_Final_es.wikipedia.org...
145060    Enamorándome_de_Ramón_es.wikipedia.org_all-acc...
145061    Hasta_el_último_hombre_es.wikipedia.org_all-ac...
145062    Francisco_el_matemático_(serie_de_televisión_d...
Name: Page, Length: 145063, dtype: object
In [7]:
train['Page'][0:5]
Out[7]:
0              2NE1_zh.wikipedia.org_all-access_spider
1               2PM_zh.wikipedia.org_all-access_spider
2                3C_zh.wikipedia.org_all-access_spider
3           4minute_zh.wikipedia.org_all-access_spider
4    52_Hz_I_Love_You_zh.wikipedia.org_all-access_s...
Name: Page, dtype: object
In [8]:
key
Out[8]:
Page Id
0 !vote_en.wikipedia.org_all-access_all-agents_2... bf4edcf969af
1 !vote_en.wikipedia.org_all-access_all-agents_2... 929ed2bf52b9
2 !vote_en.wikipedia.org_all-access_all-agents_2... ff29d0f51d5c
3 !vote_en.wikipedia.org_all-access_all-agents_2... e98873359be6
4 !vote_en.wikipedia.org_all-access_all-agents_2... fa012434263a

1.4 Missing values

In [9]:
sum(train.isnull().sum())/ (train.shape[1]+train.shape[0])
Out[9]:
42.52977735657286

2 Data transformation and helper functions

2.1 Article names and metadata

To make the training data easier to handle we split it into two part: the article information (from the Page column) and the time series data (tdates) from the date columns. We briefly separate the article information into data from wikipedia, wikimedia, and mediawiki due to the different formatting of the Page names. After that, we rejoin all article information into a common data set (tpages).

In [10]:
tdates= train.iloc[:,1:]

articles= train["Page"]
mediawiki= articles[articles.str.contains("mediawiki")]
wikimedia = articles[articles.str.contains("wikimedia")]
wikipedia = articles[articles.str.contains("wikipedia")]

wikipedia=wikipedia.to_frame()
mediawiki=mediawiki.to_frame()
wikimedia= wikimedia.to_frame()
In [11]:
def fil(x):
    if "mediawiki" not in x or "wikimedia" not in x  :
        return True
    else:
        return False

wikipedia["Page"]=wikipedia[wikipedia["Page"].apply(fil)]["Page"]
wikipedia=wikipedia.reset_index(drop=True)
In [12]:
def splitF(x):
    try:
        a1= x.split(".wikipedia.org_")
        return a1[1]
    except:
        return "None"
In [13]:
wikipedia["foo"]= wikipedia["Page"].apply(lambda x: x.split(".wikipedia.org_")[0])
wikipedia["bar"]=wikipedia["Page"].apply(splitF)
In [14]:
wikipedia
Out[14]:
Page foo bar
0 2NE1_zh.wikipedia.org_all-access_spider 2NE1_zh all-access_spider
1 2PM_zh.wikipedia.org_all-access_spider 2PM_zh all-access_spider
2 3C_zh.wikipedia.org_all-access_spider 3C_zh all-access_spider
3 4minute_zh.wikipedia.org_all-access_spider 4minute_zh all-access_spider
4 52_Hz_I_Love_You_zh.wikipedia.org_all-access_s... 52_Hz_I_Love_You_zh all-access_spider
5 5566_zh.wikipedia.org_all-access_spider 5566_zh all-access_spider
6 91Days_zh.wikipedia.org_all-access_spider 91Days_zh all-access_spider
7 A'N'D_zh.wikipedia.org_all-access_spider A'N'D_zh all-access_spider
8 AKB48_zh.wikipedia.org_all-access_spider AKB48_zh all-access_spider
9 ASCII_zh.wikipedia.org_all-access_spider ASCII_zh all-access_spider
10 ASTRO_zh.wikipedia.org_all-access_spider ASTRO_zh all-access_spider
11 Ahq_e-Sports_Club_zh.wikipedia.org_all-access_... Ahq_e-Sports_Club_zh all-access_spider
12 All_your_base_are_belong_to_us_zh.wikipedia.or... All_your_base_are_belong_to_us_zh all-access_spider
13 AlphaGo_zh.wikipedia.org_all-access_spider AlphaGo_zh all-access_spider
14 Android_zh.wikipedia.org_all-access_spider Android_zh all-access_spider
15 Angelababy_zh.wikipedia.org_all-access_spider Angelababy_zh all-access_spider
16 Apink_zh.wikipedia.org_all-access_spider Apink_zh all-access_spider
17 Apple_II_zh.wikipedia.org_all-access_spider Apple_II_zh all-access_spider
18 As_One_zh.wikipedia.org_all-access_spider As_One_zh all-access_spider
19 B-PROJECT_zh.wikipedia.org_all-access_spider B-PROJECT_zh all-access_spider
20 B1A4_zh.wikipedia.org_all-access_spider B1A4_zh all-access_spider
21 BDSM_zh.wikipedia.org_all-access_spider BDSM_zh all-access_spider
22 BEAST_zh.wikipedia.org_all-access_spider BEAST_zh all-access_spider
23 BIGBANG_zh.wikipedia.org_all-access_spider BIGBANG_zh all-access_spider
24 BLACK_PINK_zh.wikipedia.org_all-access_spider BLACK_PINK_zh all-access_spider
25 BLEACH_zh.wikipedia.org_all-access_spider BLEACH_zh all-access_spider
26 BTOB_zh.wikipedia.org_all-access_spider BTOB_zh all-access_spider
27 Beautiful_Mind_zh.wikipedia.org_all-access_spider Beautiful_Mind_zh all-access_spider
28 Beyond_zh.wikipedia.org_all-access_spider Beyond_zh all-access_spider
29 Big_zh.wikipedia.org_all-access_spider Big_zh all-access_spider
... ... ... ...
127185 Sin_senos_sí_hay_paraíso_es.wikipedia.org_all-... Sin_senos_sí_hay_paraíso_es all-access_spider
127186 Anexo:Medallero_de_los_Juegos_Olímpicos_de_Río... Anexo:Medallero_de_los_Juegos_Olímpicos_de_Río... all-access_spider
127187 Arrival_(película)_es.wikipedia.org_all-access... Arrival_(película)_es all-access_spider
127188 Anexo:Baloncesto_en_los_Juegos_Olímpicos_de_Rí... Anexo:Baloncesto_en_los_Juegos_Olímpicos_de_Rí... all-access_spider
127189 Hasta_que_te_conocí_(serie_de_televisión)_es.w... Hasta_que_te_conocí_(serie_de_televisión)_es all-access_spider
127190 Westworld_(serie_de_televisión)_es.wikipedia.o... Westworld_(serie_de_televisión)_es all-access_spider
127191 Milénico_es.wikipedia.org_all-access_spider Milénico_es all-access_spider
127192 Moonlight_(película)_es.wikipedia.org_all-acce... Moonlight_(película)_es all-access_spider
127193 Sully_(película)_es.wikipedia.org_all-access_s... Sully_(película)_es all-access_spider
127194 Pulsaciones_(serie_de_televisión)_es.wikipedia... Pulsaciones_(serie_de_televisión)_es all-access_spider
127195 2091_(serie_de_televisión)_es.wikipedia.org_al... 2091_(serie_de_televisión)_es all-access_spider
127196 Campeonato_Sudamericano_de_Fútbol_Sub-20_de_20... Campeonato_Sudamericano_de_Fútbol_Sub-20_de_20... all-access_spider
127197 Split_(película)_es.wikipedia.org_all-access_s... Split_(película)_es all-access_spider
127198 Huracán_Matthew_es.wikipedia.org_all-access_sp... Huracán_Matthew_es all-access_spider
127199 Fences_(película)_es.wikipedia.org_all-access_... Fences_(película)_es all-access_spider
127200 Logan_(película)_es.wikipedia.org_all-access_s... Logan_(película)_es all-access_spider
127201 La_doña_(telenovela_de_2016)_es.wikipedia.org_... La_doña_(telenovela_de_2016)_es all-access_spider
127202 RTS_(canal_de_televisión)_es.wikipedia.org_all... RTS_(canal_de_televisión)_es all-access_spider
127203 La_ley_del_corazón_es.wikipedia.org_all-access... La_ley_del_corazón_es all-access_spider
127204 The_Crown_(serie_de_televisión)_es.wikipedia.o... The_Crown_(serie_de_televisión)_es all-access_spider
127205 Drake_(músico)_es.wikipedia.org_all-access_spider Drake_(músico)_es all-access_spider
127206 Skam_(serie_de_televisión)_es.wikipedia.org_al... Skam_(serie_de_televisión)_es all-access_spider
127207 Legión_(serie_de_televisión)_es.wikipedia.org_... Legión_(serie_de_televisión)_es all-access_spider
127208 Doble_tentación_es.wikipedia.org_all-access_sp... Doble_tentación_es all-access_spider
127209 Mi_adorable_maldición_es.wikipedia.org_all-acc... Mi_adorable_maldición_es all-access_spider
127210 Underworld_(serie_de_películas)_es.wikipedia.o... Underworld_(serie_de_películas)_es all-access_spider
127211 Resident_Evil:_Capítulo_Final_es.wikipedia.org... Resident_Evil:_Capítulo_Final_es all-access_spider
127212 Enamorándome_de_Ramón_es.wikipedia.org_all-acc... Enamorándome_de_Ramón_es all-access_spider
127213 Hasta_el_último_hombre_es.wikipedia.org_all-ac... Hasta_el_último_hombre_es all-access_spider
127214 Francisco_el_matemático_(serie_de_televisión_d... Francisco_el_matemático_(serie_de_televisión_d... all-access_spider

127215 rows × 3 columns

In [15]:
def splitA(x):
    try:
        a1= x.split("_")
        return a1[1]
    except:
        return "None"
wikipedia["rowname"] = wikipedia.index 
wikipedia["article"]=wikipedia["foo"].apply(lambda x: x[0:-3])
wikipedia["locale"]= wikipedia["foo"].apply(lambda x: x[-2:])
wikipedia["access"]=wikipedia["bar"].apply(lambda x: x.split("_")[0])
wikipedia["agent"]=wikipedia["bar"].apply(splitA)
wikipedia.drop(["Page","foo","bar"] ,axis=1, inplace=True)
In [16]:
wikimedia["rowname"] = wikimedia.index 
wikimedia["article"]= wikimedia["Page"].apply(lambda x: x.split("_commons.wikimedia.org_")[0])
wikimedia["bar"]=wikimedia["Page"].apply(lambda x: x.split("_commons.wikimedia.org_")[1])
wikimedia["access"]=wikimedia["bar"].apply(lambda x: x.split("_")[0])
wikimedia["agent"]=wikimedia["bar"].apply(lambda x: x.split("_")[1])
wikimedia["locale"]="wikmed"
wikimedia.drop(["Page","bar"],axis=1, inplace=True)
In [17]:
mediawiki["rowname"] = mediawiki.index 
mediawiki["article"]= mediawiki["Page"].apply(lambda x: x.split("_www.mediawiki.org_")[0])
mediawiki["bar"]=mediawiki["Page"].apply(lambda x: x.split("_www.mediawiki.org_")[1])
mediawiki["access"]=mediawiki["bar"].apply(lambda x: x.split("_")[0])
mediawiki["agent"]=mediawiki["bar"].apply(lambda x: x.split("_")[1])
mediawiki["locale"]="medwik"
mediawiki.drop(["Page","bar"],axis=1, inplace=True)
In [18]:
wikipedia
Out[18]:
rowname article locale access agent
0 0 2NE1 zh all-access spider
1 1 2PM zh all-access spider
2 2 3C zh all-access spider
3 3 4minute zh all-access spider
4 4 52_Hz_I_Love_You zh all-access spider
5 5 5566 zh all-access spider
6 6 91Days zh all-access spider
7 7 A'N'D zh all-access spider
8 8 AKB48 zh all-access spider
9 9 ASCII zh all-access spider
10 10 ASTRO zh all-access spider
11 11 Ahq_e-Sports_Club zh all-access spider
12 12 All_your_base_are_belong_to_us zh all-access spider
13 13 AlphaGo zh all-access spider
14 14 Android zh all-access spider
15 15 Angelababy zh all-access spider
16 16 Apink zh all-access spider
17 17 Apple_II zh all-access spider
18 18 As_One zh all-access spider
19 19 B-PROJECT zh all-access spider
20 20 B1A4 zh all-access spider
21 21 BDSM zh all-access spider
22 22 BEAST zh all-access spider
23 23 BIGBANG zh all-access spider
24 24 BLACK_PINK zh all-access spider
25 25 BLEACH zh all-access spider
26 26 BTOB zh all-access spider
27 27 Beautiful_Mind zh all-access spider
28 28 Beyond zh all-access spider
29 29 Big zh all-access spider
... ... ... ... ... ...
127185 127185 Sin_senos_sí_hay_paraíso es all-access spider
127186 127186 Anexo:Medallero_de_los_Juegos_Olímpicos_de_Río... es all-access spider
127187 127187 Arrival_(película) es all-access spider
127188 127188 Anexo:Baloncesto_en_los_Juegos_Olímpicos_de_Rí... es all-access spider
127189 127189 Hasta_que_te_conocí_(serie_de_televisión) es all-access spider
127190 127190 Westworld_(serie_de_televisión) es all-access spider
127191 127191 Milénico es all-access spider
127192 127192 Moonlight_(película) es all-access spider
127193 127193 Sully_(película) es all-access spider
127194 127194 Pulsaciones_(serie_de_televisión) es all-access spider
127195 127195 2091_(serie_de_televisión) es all-access spider
127196 127196 Campeonato_Sudamericano_de_Fútbol_Sub-20_de_2017 es all-access spider
127197 127197 Split_(película) es all-access spider
127198 127198 Huracán_Matthew es all-access spider
127199 127199 Fences_(película) es all-access spider
127200 127200 Logan_(película) es all-access spider
127201 127201 La_doña_(telenovela_de_2016) es all-access spider
127202 127202 RTS_(canal_de_televisión) es all-access spider
127203 127203 La_ley_del_corazón es all-access spider
127204 127204 The_Crown_(serie_de_televisión) es all-access spider
127205 127205 Drake_(músico) es all-access spider
127206 127206 Skam_(serie_de_televisión) es all-access spider
127207 127207 Legión_(serie_de_televisión) es all-access spider
127208 127208 Doble_tentación es all-access spider
127209 127209 Mi_adorable_maldición es all-access spider
127210 127210 Underworld_(serie_de_películas) es all-access spider
127211 127211 Resident_Evil:_Capítulo_Final es all-access spider
127212 127212 Enamorándome_de_Ramón es all-access spider
127213 127213 Hasta_el_último_hombre es all-access spider
127214 127214 Francisco_el_matemático_(serie_de_televisión_d... es all-access spider

127215 rows × 5 columns

In [19]:
frames=[wikipedia,wikimedia,mediawiki]
tpages=pd.concat(frames,ignore_index=True)
C:\Users\Admin\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  

2.2 Time series extraction

In order to plot the time series data we use a helper function that allows us to extract the time series for a specified row number. (The normalised version is to facilitate the coparision between multiple time series curves, to correct for large differences in view count.)

In [20]:
dateView= pd.DataFrame(columns={'dates','views'})
def extract_ts(rownr):
    dateView["dates"]=tdates.columns.values
    dateView["views"]=tdates.values[rownr,:]
    return dateView
       
In [21]:
dateMeanView= pd.DataFrame(columns={'dates','views'})
def extract_ts_nrm(rownr):
    dateMeanView["dates"]=tdates.columns.values
    mean= np.mean(tdates.values[rownr,:])
    npArray= np.array(tdates.values[rownr,:])/mean
    dateMeanView["views"]=npArray
    return dateMeanView

A custom-made plotting function allows us to visualise each time series and extract its meta data

In [22]:
def plot_rownr(rownr):
    art=tpages['article'][rownr]
    loc=tpages['locale'][rownr]
    acc=tpages['access'][rownr]
    dateView=extract_ts(rownr)
    #return dateView
    dateView["dates"] = pd.to_datetime(dateView['dates'])
    return ggplot(dateView,aes(x='dates',y='views'))+ geom_line()+geom_smooth(color = "blue", span = 1/5)+ stat_smooth(method = "lm") +labs(title = art+"-"+loc+"-"+acc)
    

This is how it works (to visualise timey-wimey stuff):

In [23]:
plot_rownr(1)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[23]:
<ggplot: (190650717476)>
In [24]:
def plot_rownr_log(rownr):
    art=tpages['article'][rownr]
    loc=tpages['locale'][rownr]
    acc=tpages['access'][rownr]
    dateView=extract_ts_nrm(rownr)
    dateView["dates"] = pd.to_datetime(dateView['dates'])
    return ggplot(dateView,aes(x='dates',y='views'))+ geom_line()+geom_smooth(color = "blue", span = 1/5)+ stat_smooth(method = "lm") +labs(title = art+"-"+loc+"-"+acc)
    labs(title = art+"-"+loc+"-"+acc)+scale_y_log10() + labs(y = "log views")
    
In [25]:
plot_rownr_log(1)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[25]:
<ggplot: (190695121451)>
In [26]:
def plot_rownr_zoom(rownr, start, end):
    art=tpages['article'][rownr]
    loc=tpages['locale'][rownr]
    acc=tpages['access'][rownr]
    dateView=extract_ts_nrm(rownr)
    dateView["dates"] = pd.to_datetime(dateView['dates'])
    startDate=datetime.strptime(start , '%Y-%m-%d')
    endDate=datetime.strptime(end , '%Y-%m-%d')
    dateView=dateView[(dateView["dates"]>=startDate) & (dateView["dates"]<=endDate) ]
    #return dateView
    return ggplot(dateView,aes(x='dates',y='views'))+ geom_line()+geom_smooth(color = "blue", span = 1/5)+ stat_smooth(method = "lm") +labs(title = art+"-"+loc+"-"+acc)
    labs(title = art+"-"+loc+"-"+acc)+scale_y_log10() + labs(y = "log views")
In [27]:
plot_rownr_zoom(1,'2015-03-01','2015-09-01')
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[27]:
<ggplot: (190695165858)>

In addition, with the help of the extractor tool we define a function that re-connects the Page information to the corresponding time series and plots this curve according to our specification on article name, access type, and agent for all the available languages:

In [28]:
def plot_names(art, acc, ag):
    pick=tpages[(tpages['article']==art) & (tpages['access']==acc) & (tpages['agent']==ag)]
    pick_nr=pick['rowname']
    pick_loc=pick['locale']
    tdat= extract_ts(pick_nr.values[0])
    tdat["loc"]=pick_loc.values[0]
    for i in range(1,len(pick)):
        foo= extract_ts(pick_nr.values[i])
        foo["loc"]=pick_loc.values[i]
        tdat=pd.concat([tdat,foo])
    tdat["dates"] = pd.to_datetime(tdat['dates'])
    plt=ggplot(tdat,aes(x='dates',y='views',color = tdat["loc"]))+geom_line() + labs(title = art+"-"+acc+"-"+ag)
    return plt
def plot_names_nrm(art, acc, ag):
    pick=tpages[(tpages['article']==art) & (tpages['access']==acc) & (tpages['agent']==ag)]
    pick_nr=pick['rowname']
    pick_loc=pick['locale']
    tdat= extract_ts_nrm(pick_nr.values[0])
    tdat["loc"]=pick_loc.values[0]
    for i in range(1,len(pick)):
        foo= extract_ts(pick_nr.values[i])
        foo["loc"]=pick_loc.values[i]
        tdat=pd.concat([tdat,foo])
    tdat["dates"] = pd.to_datetime(tdat['dates'])
    plt=ggplot(tdat,aes(x='dates',y='views',color = tdat["loc"]))+geom_line() + labs(title = art+"-"+acc+"-"+ag)
    return plt
In [29]:
plot_names("The_Beatles", "all-access", "all-agents")
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  ndistinct = ids.apply(len_unique, axis=0).as_matrix()
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[29]:
<ggplot: (-9223371846250649172)>

These are the tools we need for a visual examinination of arbitrary individual time series data. In the following, we will use them to illustrate specific observations that are of particular interest.

3 Summary parameter extraction

In the next step we will have a more global look at the population parameters of our training time series data. Also here, we will start with the wikipedia data. The idea behind this approach is to probe the parameter space of the time series information along certain key metrics and to identify extreme observations that could break our forecasting strategies.

3.1 Projects data overview

Before diving into the time series data let’s have a look how the different meta-parameters are distributed:

In [30]:
ggplot(tpages, aes(x='agent')) + geom_bar(fill = "red")
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  ndistinct = ids.apply(len_unique, axis=0).as_matrix()
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\positions\position.py:188: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  intervals = data[xminmax].drop_duplicates().as_matrix().flatten()
Out[30]:
<ggplot: (190648642119)>
In [31]:
ggplot(tpages, aes(x='access')) + geom_bar(fill = "red")
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  ndistinct = ids.apply(len_unique, axis=0).as_matrix()
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\positions\position.py:188: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  intervals = data[xminmax].drop_duplicates().as_matrix().flatten()
Out[31]:
<ggplot: (190648666309)>
In [32]:
ggplot(tpages, aes(x='locale', fill=tpages['locale'])) + geom_bar()
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  ndistinct = ids.apply(len_unique, axis=0).as_matrix()
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\positions\position.py:188: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  intervals = data[xminmax].drop_duplicates().as_matrix().flatten()
Out[32]:
<ggplot: (190648600703)>

We find that our wikipedia data includes 7 languages: German, English, Spanish, French, Japanese, Russian, and Chinese. All of those are more frequent than the mediawiki and wikimedia pages. Mobile sites are slightly more frequent than desktop ones

3.2 Basic time series parameters

We start with a basic set of parameters: mean, standard deviation, amplitude, and a the slope of a naive linear fit. This is our extraction function:

In [33]:
param= pd.DataFrame(columns={'rowname','slope','min_view','max_view','mean_view','med_view','sd_view'})
def params_ts1(rownr):
    dateView=extract_ts(rownr)
    dateView["dates"] = pd.to_datetime(dateView['dates'])
    y=dateView["views"]
    x= dateView[["dates"]]
    model= lm.LinearRegression()
    results=model.fit(x,y)
    param['rowname'] = rownr
    param['slope']= model.coef_
    param['min_view'] = np.min(dateView["views"])
    param['max_view'] = np.max(dateView["views"])
    param['mean_view'] = np.mean(dateView["views"])
    param['med_view'] = np.median(dateView["views"])
    param['sd_view'] = np.std(dateView["views"])
    return param

And here we run it. (Note, that in this kernel version I’m currently using a sub-sample of the data for reasons of runtime. My extractor function is not very elegant, yet, and exceeds the kernel runtime for the complete data set.)

In [34]:
x=np.random.choice(tpages["rowname"], size=5500)
x.astype(object)
Out[34]:
array([83779, 90203, 102827, ..., 63937, 31909, 122732], dtype=object)
In [ ]:
 
In [35]:
joinedParam= pd.DataFrame(columns={'rowname','slope','min_view','max_view','mean_view','med_view','sd_view'})
for i in x:
    dateView=extract_ts(i)
    dateView["dates"] = pd.to_datetime(dateView['dates']) 
    if not dateView["views"].isnull().values.any():
        joinedParam=pd.concat([joinedParam,params_ts1(i)])
joinedParam.index=joinedParam["rowname"]

3.3 Overview visualisations

Let’s explore the parameter space we’ve built. (The global shape of the distributions should not be affected by the sampling.) First we plot the histograms of our main parameters:

In [36]:
ggplot(joinedParam, aes(x='mean_view'))+geom_histogram(fill = "red", bins = 50) + scale_x_log10()
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\positions\position.py:188: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  intervals = data[xminmax].drop_duplicates().as_matrix().flatten()
Out[36]:
<ggplot: (-9223371846206220905)>
In [37]:
ggplot(joinedParam, aes(x='med_view'))+geom_histogram(fill = "red", bins = 50) + scale_x_log10()
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\scales\scale.py:516: RuntimeWarning: divide by zero encountered in log10
  return self.trans.transform(x)
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\layer.py:363: UserWarning: stat_bin : Removed 4 rows containing non-finite values.
  data = self.stat.compute_layer(data, params, layout)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\positions\position.py:188: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  intervals = data[xminmax].drop_duplicates().as_matrix().flatten()
Out[37]:
<ggplot: (190598744626)>
In [38]:
difsdmean= pd.DataFrame(columns=["sd/mean"])
difsdmean["sd/mean"]=np.array(joinedParam["sd_view"])/np.array(joinedParam["mean_view"])
ggplot(difsdmean, aes(x='sd/mean'))+geom_histogram(fill = "red", bins = 50) + scale_x_log10()
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\positions\position.py:188: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  intervals = data[xminmax].drop_duplicates().as_matrix().flatten()
Out[38]:
<ggplot: (190652485994)>
In [39]:
ggplot(joinedParam, aes(x='slope'))+geom_histogram(fill = "red", bins = 50) + scale_x_log10()
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\scales\scale.py:516: RuntimeWarning: invalid value encountered in log10
  return self.trans.transform(x)
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\layer.py:363: UserWarning: stat_bin : Removed 1523 rows containing non-finite values.
  data = self.stat.compute_layer(data, params, layout)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\positions\position.py:188: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  intervals = data[xminmax].drop_duplicates().as_matrix().flatten()
Out[39]:
<ggplot: (-9223371846202273234)>

We find:

  • The distribution of average views is clearly bimodal, with peaks around 10 and 200-300 views. Something similar is true for the number of maximum views, although here the first peak (around 200) is curiuosly narrow. The second peak is centred above 10,000.

  • The distribution of standard deviations (divided by the mean) is skewed toward higher values with larger numbers of spikes or stronger variability trends. Those will be the observations that are more challenging to forecast.

  • The slope distribution is resonably symmetric and centred notably above zero.

In [40]:
#par_page= tpages.join(joinedParam,on="rowname",lsuffix="l",rsuffix="r")
#par_page
par_page= tpages.merge(joinedParam,on="rowname")
par_page
C:\Users\Admin\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: FutureWarning: 'rowname' is both an index level and a column label.
Defaulting to column, but this will raise an ambiguity error in a future version
  This is separate from the ipykernel package so we can avoid doing imports until
Out[40]:
access agent article locale rowname max_view min_view med_view sd_view slope mean_view
0 all-access spider AKB48 zh 8 203.0 5.0 31.0 21.105018 4.012260e-16 35.007273
1 all-access spider I'm_Home zh 72 67.0 0.0 7.0 5.832916 6.407526e-17 7.838182
2 all-access spider SHINee zh 130 187.0 2.0 17.0 21.211062 2.921021e-16 24.134545
3 all-access spider Super_Junior zh 140 192.0 3.0 22.0 13.803907 2.207308e-16 24.867273
4 all-access spider X_Japan zh 170 179.0 0.0 9.0 13.030479 2.887371e-16 12.254545
5 all-access spider 佛教 zh 203 419.0 7.0 31.0 27.616592 2.321988e-16 34.698182
6 all-access spider 佛教 zh 203 419.0 7.0 31.0 27.616592 2.321988e-16 34.698182
7 all-access spider 一個中國_(中華人民共和國) zh 228 742.0 0.0 6.0 44.612766 -1.691716e-17 11.792727
8 all-access spider 王嘉爾 zh 268 342.0 1.0 16.5 31.245527 6.282544e-16 26.078182
9 all-access spider 崔宇植 zh 305 168.0 0.0 6.0 10.073253 8.956871e-17 8.389091
10 all-access spider 冰與火之歌 zh 318 104.0 1.0 16.0 11.104411 3.000068e-16 18.069091
11 all-access spider 商鞅 zh 328 152.0 1.0 13.0 16.535554 1.821560e-16 17.861818
12 all-access spider 迪丽热巴 zh 369 84.0 0.0 8.0 6.389579 9.938550e-17 8.703636
13 all-access spider 流感_(電影) zh 370 26.0 0.0 5.0 3.787550 1.479920e-16 5.774545
14 all-access spider 王楠 zh 402 29.0 0.0 5.0 3.966231 1.255535e-16 6.225455
15 all-access spider 三民主義 zh 449 370.0 4.0 31.0 48.289658 -3.026527e-16 44.449091
16 all-access spider 妖夜尋狼 zh 474 118.0 0.0 5.0 8.284229 2.519108e-16 7.025455
17 all-access spider 香港電子競技戰隊 zh 508 752.0 0.0 8.0 43.435601 -2.589603e-16 16.576364
18 all-access spider 黄渤 zh 539 225.0 1.0 11.0 11.240338 8.597282e-17 12.554545
19 all-access spider 鲍勃·迪伦 zh 562 519.0 0.0 9.0 25.855140 4.009663e-16 12.458182
20 all-access spider 愛黛兒 zh 564 169.0 1.0 13.0 11.639010 3.766896e-17 14.798182
21 all-access spider 亚伯拉罕·林肯 zh 576 210.0 1.0 16.0 13.022846 2.245892e-16 17.994545
22 all-access spider 珀斯 zh 582 181.0 1.0 10.0 10.566097 1.065097e-16 11.481818
23 all-access spider 張晉 zh 591 126.0 1.0 8.0 9.374638 -2.553769e-17 9.787273
24 all-access spider 還珠格格 zh 678 156.0 1.0 12.0 11.935477 1.264613e-16 13.827273
25 all-access spider 沈富雄 zh 708 56.0 0.0 7.0 6.110675 5.828386e-17 8.523636
26 all-access spider 監獄學園 zh 710 218.0 0.0 15.0 20.737254 -1.021466e-16 19.541818
27 all-access spider 張善為 zh 723 100.0 0.0 7.0 6.197215 8.217241e-17 7.567273
28 all-access spider 崔泰俊 zh 740 89.0 0.0 7.0 8.508525 3.218464e-16 8.943636
29 all-access spider 亞人_(漫畫) zh 748 207.0 0.0 10.0 15.531769 3.316644e-16 13.812727
... ... ... ... ... ... ... ... ... ... ... ...
4859 all-access spider Muro_de_Berlín es 126343 258.0 0.0 3.0 17.697821 1.480871e-16 4.905455
4860 all-access spider Bulbo_raquídeo es 126367 69.0 1.0 9.0 10.852134 2.604558e-16 12.534545
4861 all-access spider Cien_años_de_soledad es 126379 668.0 5.0 34.0 30.481136 1.499588e-16 35.196364
4862 all-access spider Cien_años_de_soledad es 126379 668.0 5.0 34.0 30.481136 1.499588e-16 35.196364
4863 all-access spider Éver_Banega es 126404 252.0 4.0 21.0 17.949592 -4.581241e-17 23.318182
4864 all-access spider Bibiana_Fernández es 126416 1232.0 4.0 23.0 54.906279 1.583451e-16 27.901818
4865 all-access spider Región_Amazónica_(Colombia) es 126430 585.0 0.0 9.0 26.069282 -2.707372e-17 10.787273
4866 all-access spider Guerra_de_los_Mil_Días es 126498 289.0 1.0 10.0 16.301199 2.321984e-16 12.998182
4867 all-access spider Día_de_la_Raza es 126528 178.0 0.0 6.0 9.102425 -4.248783e-17 7.850909
4868 all-access spider Modelo_atómico_de_Rutherford es 126539 193.0 0.0 6.0 13.056450 -1.064804e-16 8.994545
4869 all-access spider Ángulo es 126571 452.0 1.0 9.0 21.055969 -1.727612e-18 11.161818
4870 all-access spider Ana_Obregón es 126582 1202.0 24.0 75.0 64.488506 9.680844e-16 83.041818
4871 all-access spider Máximo_común_divisor es 126594 1401.0 2.0 28.0 63.599324 8.584969e-17 33.305455
4872 all-access spider Penélope_Cruz es 126718 124.0 0.0 8.0 8.441488 4.941998e-19 10.101818
4873 all-access spider Volcán es 126728 456.0 4.0 20.0 24.472039 -3.802041e-17 22.780000
4874 all-access spider Energía es 126731 635.0 3.0 20.0 35.821578 3.160608e-16 25.600000
4875 all-access spider Enrique_Peña_Nieto es 126736 151.0 0.0 8.0 13.572973 1.474030e-16 10.369091
4876 all-access spider Ciencia_ficción es 126751 1416.0 6.0 36.0 75.965807 8.512379e-16 46.212727
4877 all-access spider Beyoncé es 126787 297.0 1.0 11.0 15.232523 -7.486459e-18 13.454545
4878 all-access spider Pablo_Iglesias_Turrión es 126818 337.0 5.0 26.0 18.247882 1.365265e-16 29.045455
4879 all-access spider Lenín_Moreno es 126845 1712.0 2.0 17.0 77.752393 -2.058108e-16 21.887273
4880 all-access spider Daredevil_(serie_de_televisión) es 126852 561.0 3.0 15.0 26.178194 8.927736e-17 18.218182
4881 all-access spider Annabelle_(película) es 126859 1003.0 2.0 18.0 46.741821 -1.272840e-16 24.076364
4882 all-access spider Juegos_Olímpicos_de_Atenas_1896 es 126866 2218.0 6.0 32.0 98.226056 -3.743534e-16 40.609091
4883 all-access spider Carlos_Ruiz_Zafón es 126883 26.0 0.0 3.0 2.486077 1.416358e-17 3.681818
4884 all-access spider Martín_Lutero es 126981 1242.0 0.0 12.0 63.128327 -3.166185e-16 18.578182
4885 all-access spider Autobiografía es 126988 377.0 1.0 17.0 20.179142 -1.384523e-16 19.336364
4886 all-access spider Lidia_Valentín es 127004 247.0 0.0 5.0 13.159721 4.419833e-18 6.321818
4887 all-access spider Bimba_Bosé es 127065 468.0 1.0 17.0 22.071578 5.870042e-17 18.887273
4888 all-access spider Juan_Martín_del_Potro es 127066 55.0 0.0 3.0 3.988881 6.987626e-17 3.434545

4889 rows × 11 columns

Let’s split it up by locale and focus on the densities:

In [41]:
ggplot(par_page, aes(x='mean_view', fill='locale'))+ geom_density(position = "stack") +scale_x_log10(limits = [1,1e4]) 
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  ndistinct = ids.apply(len_unique, axis=0).as_matrix()
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\layer.py:363: UserWarning: stat_density : Removed 49 rows containing non-finite values.
  data = self.stat.compute_layer(data, params, layout)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\positions\position.py:188: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  intervals = data[xminmax].drop_duplicates().as_matrix().flatten()
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\positions\position_stack.py:82: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  data = pd.concat([neg, pos], axis=0, ignore_index=True)
Out[41]:
<ggplot: (190652572592)>
In [42]:
ggplot(par_page, aes(x='max_view', fill='locale'))+ geom_density(position = "stack") +scale_x_log10(limits = [1,1e4]) 
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  ndistinct = ids.apply(len_unique, axis=0).as_matrix()
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\layer.py:363: UserWarning: stat_density : Removed 1674 rows containing non-finite values.
  data = self.stat.compute_layer(data, params, layout)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\positions\position.py:188: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  intervals = data[xminmax].drop_duplicates().as_matrix().flatten()
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\positions\position_stack.py:82: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  data = pd.concat([neg, pos], axis=0, ignore_index=True)
Out[42]:
<ggplot: (190648424512)>
In [43]:
ggplot(par_page, aes(x='slope', fill='locale'))+ geom_density(position = "stack") 
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  ndistinct = ids.apply(len_unique, axis=0).as_matrix()
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\positions\position.py:188: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  intervals = data[xminmax].drop_duplicates().as_matrix().flatten()
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\positions\position_stack.py:82: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  data = pd.concat([neg, pos], axis=0, ignore_index=True)
Out[43]:
<ggplot: (190648907840)>
In [44]:
ggplot(par_page, aes(x='sd_view', fill='locale'))+ geom_density(position = "stack") +scale_x_log10(limits = [1,1e4]) 
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  ndistinct = ids.apply(len_unique, axis=0).as_matrix()
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\layer.py:363: UserWarning: stat_density : Removed 145 rows containing non-finite values.
  data = self.stat.compute_layer(data, params, layout)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\positions\position.py:188: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  intervals = data[xminmax].drop_duplicates().as_matrix().flatten()
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\positions\position_stack.py:82: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  data = pd.concat([neg, pos], axis=0, ignore_index=True)
Out[44]:
<ggplot: (-9223371846202060400)>
In [45]:
twoDgraph= pd.DataFrame(columns=["max-mean","mean_view"])
twoDgraph["max-mean"]= np.array(joinedParam["max_view"])-np.array(joinedParam["mean_view"])
twoDgraph["mean_view"]= joinedParam["mean_view"].values

We find:

  • The chinese pages (zh, in pink) are slightly but notably different from the rest. The have lower mean and max views and also less variation. Their slope distribution is broader, but also shifted more towards positive values compared to the other curves.

  • The peak in max views around 200-300 is most pronounced in the french pages (fr, in turquoise).

  • The english pages (en, in mustard) have the highest mean and maximum views, which is not surprising.

Next, we will examine binned 2-d histograms.

In [46]:
ggplot(twoDgraph,aes(x="max-mean",y="mean_view"))+geom_bin2d(bins = [50,50]) +scale_x_log10() +scale_y_log10() + labs(x = "maximum views above mean", y = "mean views")
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[46]:
<ggplot: (190652706957)>

We find:

  • There is a clear correlation between mean views and maximum views. Also here we find again the two cluster peaks we had identified in the individual histograms. A couple of outliers and outlier groups are noticeable. Let’s zoom into the upper right corner (the numbers in parentheses are the row numbers):
In [47]:
twoDgraph.size
Out[47]:
8380
In [48]:
limx = [max(joinedParam["max_view"]/50), max(joinedParam["max_view"])]
limy = [max(joinedParam["mean_view"]/50), max(joinedParam["mean_view"])]
ggplot(twoDgraph,aes(x="max-mean",y="mean_view"))+ geom_point(size = 5, color = "red", alpha=0.3) +scale_x_log10(limits=limx) +scale_y_log10(limits=limy) + labs(x = "maximum views above mean", y = "mean views") 
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\layer.py:450: UserWarning: geom_point : Removed 4054 rows containing missing values.
  self.data = self.geom.handle_na(self.data)
Out[48]:
<ggplot: (190652476210)>

Here we find a number of main pages and other meta pages (in the full data set).

Another question: Does the (assumed) linear change in views depend on the total number of views?

In [49]:
ggplot(joinedParam,aes(x='slope',y='mean_view'))+geom_point(color = "red", alpha = 0.1) +scale_y_log10()+scale_x_log10() +labs(x = "linear slope relative to slope error", y = "mean views")
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\scales\scale.py:516: RuntimeWarning: invalid value encountered in log10
  return self.trans.transform(x)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\layer.py:450: UserWarning: geom_point : Removed 1523 rows containing missing values.
  self.data = self.geom.handle_na(self.data)
Out[49]:
<ggplot: (190652476224)>

We find that articles with higher average view-count have more variability in their linear trends. However, this might be due to our slope normalisation which will decrease the effective slope for low view counts. It should not, however, affect the observation that the slopes of low-view articles are on average slightly higher than those of high-view articles. Such an effect could be caused by viewing spikes, of course, but I would expect those to be randomly distributed.

4 Individual observations with extreme parameters

Based on the overview parameters we can focus our attention on those articles for which the time series parameters are at the extremes of the parameter space.

4.1 Large linear slope

Those are the observations with the highest slope values. (In the sample this will be different, but in the full wikipedia data set the top 10 have rownames 91728, 55587, 108341, 70772, 95367, 18357, 95229, 116150, 94975, 77292).

In [50]:
slopesort=joinedParam.sort_values(by='slope',ascending=False).head(n=5)

Let’s have a look at the time series data of the top 4 articles:

In [60]:
plot_rownr(int(slopesort['rowname'].values[0]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[60]:
<ggplot: (190652601762)>
In [62]:
plot_rownr(int(slopesort['rowname'].values[1]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[62]:
<ggplot: (190607411832)>
In [63]:
plot_rownr(int(slopesort['rowname'].values[2]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[63]:
<ggplot: (-9223371846247377997)>
In [64]:
plot_rownr(int(slopesort['rowname'].values[3]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[64]:
<ggplot: (-9223371846202655281)>
In [65]:
plot_rownr(int(slopesort['rowname'].values[4]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[65]:
<ggplot: (-9223371846205868351)>

We find:

Lot’s of love for Twenty One Pilots in Spain. Those rapid rises and wibbly-wobbly bits are going to be difficult to predict, unless there’s a periodic modulation on top of the large-scale trend. Certaintly worth figuring out.

We also see that our generic loess smoother is dealing rather well with most of the slower variability patterns and could be used to remove the low-frequency structures for further analysis.

Let’s compare the interest in Twenty One Pilots for the different countries, to see whether a prediction for one of them could learn from the others:

In [66]:
plot_names_nrm("Twenty_One_Pilots", "all-access", "all-agents")
C:\Users\Admin\Anaconda3\lib\site-packages\plotnine\utils.py:281: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  ndistinct = ids.apply(len_unique, axis=0).as_matrix()
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[66]:
<ggplot: (-9223371846202510823)>

Note, that those curves are normalised to mean views (each) and have a logarithmic y-axis to mitigate the effect of large spikes. This chart is for relative trend comparison.

We find:

  • Germany and France show quite similar viewing behaviour, while Russia and Spain are comparable too; especially in the early rise in interest. The English pages show less dramatic changes but end up

  • With a purely time-series-forecast approach I think that the large spikes are close to impossible to predict. However, external data could help a lot here.

  • Those viewing numbers were going up, but which articles were going down? (Top 10: 95856, 74115, 8388, 103659, 100213, 9633, 102481 38458, 30042, 74002

In [67]:
#Article going down
articleAscending=joinedParam.sort_values(by='slope',ascending=True).head(n=5)
articleAscending
Out[67]:
max_view min_view med_view sd_view slope mean_view rowname
rowname
101457.0 638556.0 12119.0 22068.5 88932.764827 -1.512772e-12 39924.494545 101457.0
42059.0 1728694.0 274.0 544.5 169174.680531 -1.246962e-12 52019.338182 42059.0
99537.0 1412292.0 85970.0 181508.5 77589.183380 -9.844988e-13 188662.325455 99537.0
13053.0 611564.0 3746.0 12335.0 58739.412784 -9.217630e-13 28224.632727 13053.0
9632.0 38175.0 929.0 4632.5 11589.032011 -7.583681e-13 11632.840000 9632.0
In [69]:
plot_rownr(int(articleAscending['rowname'].values[0]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[69]:
<ggplot: (190652209363)>
In [70]:
plot_rownr(int(articleAscending['rowname'].values[1]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[70]:
<ggplot: (190652114559)>
In [71]:
plot_rownr(int(articleAscending['rowname'].values[2]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[71]:
<ggplot: (190652202493)>
In [72]:
plot_rownr(int(articleAscending['rowname'].values[3]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[72]:
<ggplot: (-9223371846202419540)>

The main page itself on mobile, and review articles on 2015 were the biggest losers.

4.2 High standard deviations

The top 10 wikipedia rows are 9775, 38574, 103124, 99323, 74115, 39181, 10404, 33645, 34258, and 26994. Bingo, anyone?

In [73]:
# 4.2 High standard deviations
joinedParam['sd_div_mean']= joinedParam['sd_view']/joinedParam['mean_view']
sddivsort=joinedParam.sort_values(by='sd_div_mean',ascending=True).head(n=5)
In [74]:
#plot 4 graphs
plot_rownr(int(sddivsort['rowname'].values[0]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[74]:
<ggplot: (-9223371846202614966)>
In [75]:
plot_rownr(int(sddivsort['rowname'].values[1]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[75]:
<ggplot: (190652361683)>
In [76]:
plot_rownr(int(sddivsort['rowname'].values[2]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[76]:
<ggplot: (-9223371846202264106)>
In [77]:
plot_rownr(int(sddivsort['rowname'].values[3]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[77]:
<ggplot: (-9223371846159100009)>

Those are pretty strong spikes in the main page views, even if the baseline is around 1-10 million to begin with. They look consistent though over different languages. Any ideas what could cause this?

If we normalise standard deviation by mean we get a different set of results:

In [78]:
plot_rownr(10032)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[78]:
<ggplot: (190691117902)>
In [79]:
plot_rownr(38812)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[79]:
<ggplot: (190652327081)>
In [80]:
plot_rownr(86905)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[80]:
<ggplot: (190651941524)>
In [81]:
plot_rownr(102521)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[81]:
<ggplot: (-9223371846202371647)>

Those are very, very suspicious. They are essentially low baselines with single dates that have way higher view counts (e.g. around 20 vs 2 million for the upper left one). These have to be errors in the data which can be dangerous for predictions if they appear close to either end of the date window. In other cases, most smoothing methods should be able to deal with them.

4.3 Large variability amplitudes

The top amplitudes are the same as the top standard deviations, due to the spikey nature of the variability:

In [82]:
#Large variability amplitudes
joinedParam['maxView_meanView']= joinedParam['max_view']-joinedParam['mean_view']
maxView_meanViewSort=joinedParam.sort_values(by='maxView_meanView',ascending=False).head(n=5)
In [83]:
plot_rownr(int(sddivsort['rowname'].values[0]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[83]:
<ggplot: (-9223371846248467368)>
In [84]:
plot_rownr(int(sddivsort['rowname'].values[1]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[84]:
<ggplot: (190652282481)>
In [85]:
plot_rownr(int(sddivsort['rowname'].values[2]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[85]:
<ggplot: (190652285623)>
In [86]:
plot_rownr(int(sddivsort['rowname'].values[3]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[86]:
<ggplot: (190652232368)>

4.4 High average views

Those are the time series of the most popular pages, which we already identified as the main pages in the plots above:

In [87]:
meandescSort=joinedParam.sort_values(by='mean_view',ascending=False).head(n=5)
In [88]:
plot_rownr(int(meandescSort['rowname'].values[0]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[88]:
<ggplot: (190652350805)>
In [89]:
plot_rownr(int(meandescSort['rowname'].values[1]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[89]:
<ggplot: (190648438925)>
In [90]:
plot_rownr(int(meandescSort['rowname'].values[2]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[90]:
<ggplot: (190695211128)>
In [91]:
plot_rownr(int(meandescSort['rowname'].values[3]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[91]:
<ggplot: (-9223371846206140223)>

In addition to the spikes on the english main page there is a suprising amount of variability as exemplified by the long-term structure in the German main page.

What about other main pages, as identified in the zoom-in above?

In [93]:
plot_rownr_log(int(meandescSort['rowname'].values[0]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[93]:
<ggplot: (190652373196)>
In [94]:
plot_rownr_log(int(meandescSort['rowname'].values[1]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[94]:
<ggplot: (190652601913)>
In [95]:
plot_rownr_log(int(meandescSort['rowname'].values[2]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[95]:
<ggplot: (190652592311)>
In [96]:
plot_rownr_log(int(meandescSort['rowname'].values[3]))
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[96]:
<ggplot: (-9223371846205853973)>

Here 3 of the 4 plots have a logarithmic y-axis to improve the clarity of visualising the time series’ with strong spikes. We see that also those popular pages exhibit strong variability on various time scales.

In summary: We have identified the time series’ with the highest variability according to basic criteria. We also found a few time series sets with bogus values. These are the data sets that might pose the greatest challenge to our prediction algorithms.

5 Short-term variability

Before turning to forecasting methods, let’s have a closer look at the characteristic short-term variability that has become evident in several of the plots already. Below, we plot a 2-months zoom into the “quiet” parts (i.e. no strong spikes) of different time series:

In [97]:
plot_rownr_zoom(10404, "2016-10-01", "2016-12-01")
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[97]:
<ggplot: (-9223371846202300486)>
In [98]:
plot_rownr_zoom(9775, "2015-09-01", "2015-11-01")
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[98]:
<ggplot: (-9223371846202192489)>
In [99]:
plot_rownr_zoom(139120, "2016-10-01", "2016-12-01")
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[99]:
<ggplot: (190652110172)>
In [100]:
plot_rownr_zoom(110658, "2016-07-01", "2016-09-01")
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[100]:
<ggplot: (-9223371846158471392)>

We see that the high-view-count time series on the left hand side show a very regular periodicity that is strikingly similar for both of them. A similar structure can be seen on the right hand side, although here it is partly distorted by a slight upward trend (upper right) and/or variance caused by lower viewing numbers (lower right).

These plots provide evidence that there is variability on a weekly scale. The next figure will visualise this weekly behaviour in a different way:

In [101]:
rownr=10404

start = "2016-10-01"

end = "2016-12-01"
dateView1=extract_ts_nrm(rownr)
dateView1["dates"] = pd.to_datetime(dateView1['dates'])
startDate=datetime.strptime(start , '%Y-%m-%d')
endDate=datetime.strptime(end , '%Y-%m-%d')
dateView1=dateView1[(dateView1["dates"]>=startDate) & (dateView1["dates"]<=endDate)]
dateView1["wday_views"]=dateView1["views"].mean()
dateView1["wday_views"]=dateView1["wday_views"]/np.mean(dateView1["wday_views"])
dateView1["rowname"]= dateView1.index
rownr=9775

start ="2015-09-01"

end = "2015-11-01"
dateView2=extract_ts_nrm(rownr)
dateView2["dates"] = pd.to_datetime(dateView2['dates'])
startDate=datetime.strptime(start , '%Y-%m-%d')
endDate=datetime.strptime(end , '%Y-%m-%d')
dateView2=dateView2[(dateView2["dates"]>=startDate) & (dateView2["dates"]<=endDate)]
dateView2["wday_views"]=dateView2["views"].mean()
dateView2["wday_views"]=dateView2["wday_views"]/np.mean(dateView2["wday_views"])
dateView2["rowname"]= dateView2.index

rownr=110658

start ="2016-07-01"

end = "2016-09-01"
dateView3=extract_ts_nrm(rownr)
dateView3["dates"] = pd.to_datetime(dateView3['dates'])
startDate=datetime.strptime(start , '%Y-%m-%d')
endDate=datetime.strptime(end , '%Y-%m-%d')
dateView3=dateView3[(dateView3["dates"]>=startDate) & (dateView3["dates"]<=endDate)]
dateView3["wday_views"]=dateView3["views"].mean()
dateView3["wday_views"]=dateView3["wday_views"]/np.mean(dateView3["wday_views"])
dateView3["rowname"]= dateView3.index


m=pd.concat([dateView1,dateView2,dateView3])
ggplot(m,aes("dates", "wday_views",color = "rowname")) +geom_jitter(size = 4, width = 0.1) +labs(x = "Day of the week", y = "Relative average views")
C:\Users\Admin\Anaconda3\lib\site-packages\ipykernel_launcher.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
C:\Users\Admin\Anaconda3\lib\site-packages\ipykernel_launcher.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
C:\Users\Admin\Anaconda3\lib\site-packages\ipykernel_launcher.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
C:\Users\Admin\Anaconda3\lib\site-packages\ipykernel_launcher.py:24: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\Users\Admin\Anaconda3\lib\site-packages\ipykernel_launcher.py:25: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\Users\Admin\Anaconda3\lib\site-packages\ipykernel_launcher.py:26: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\Users\Admin\Anaconda3\lib\site-packages\ipykernel_launcher.py:38: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\Users\Admin\Anaconda3\lib\site-packages\ipykernel_launcher.py:39: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\Users\Admin\Anaconda3\lib\site-packages\ipykernel_launcher.py:40: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[101]:
<ggplot: (-9223371846158471364)>

In this plot, a period of 7 days is indicated by the vertical blue line.

As expected, we find that all four data sets share a strong signal at a period of 1 week. This is particularly evident in the clean time series’ with high median views, but also still in the article with lower views. In our following analysis we can therefore reasonably assume that a period of 7 days is present in all our articles.

6 Forecast methods for selected examples

Now that we have identified a sample of time series’ with extreme parameters we can use them to test different forecasting methods. For a sample of 145k articles we will most likely have to rely on an automatic mechanism to make our predictions (although a degree of fine-tuning might be possible). Therefore, our forecasting method will have to perform robustly for a range of different time series shapes and varibilities. Those methods that manage to deal with our extreme examples should be able to deal with any less variable time series as well.

For this competition our forecast period is 2 monts, i.e. about 60 days. In the following, we simulate this period and assess our prediction accuracy by keeping a hold-out sample of the last 60 days from our forecasting data. After making the prediction we can compare the actual view counts to the forecasted ones.

6.1 ARIMA / auto.arima

A popular approach in time series forecasting is to use an autoregressive integrated moving average model; short ARIMA model. This kind of model consists of three parts, parametrised by indeces p, d, q as ARIMA(p, d, q):

  • auto-regressive / p: we are using past data to compute a regression model for future data. The parameter p indicates the range of lags; e.g. ARIMA(3,0,0) includes t-1, t-2, and t-3 values in the regression to compute the value at t.

  • integrated / d: this is a differencing parameter, which gives us the number of times we are subtracting the current and the previous values of a time series. Differencing removes the change in a time series in that it stabilises the mean and removes (seasonal) trends. This is necessary since computing the lags (e.g. difference between time t and time t-1) is most meaningful if large-scale trends are removed. A time series where the variance (or amount of variability) (and the autocovariance) are time-invariant (i.e. don’t change from day to day) is called stationary.

  • moving average / q: this parameter gives us the number of previous error terms to include in the regression error of the model.

Using our insight about the weekly periodicity, we can directly incorporate this frequency when turning our view counts into a time series object (using the ts function). Note, that we also perform some cleaning and outlier rejection using the tsclean tool. As usual, we wrap the modelling and plotting process into a function and then apply it to four time series sets that we know from our previous analysis:

In [103]:
pre_views_arima= pd.DataFrame(columns={'dates','views'})
post_views_arima= pd.DataFrame(columns={'dates','views'})
def plot_auto_arima_rownr(rownr):
    pageviews_arima= extract_ts(rownr)
    pred_len=60
    pred_range=[(pageviews_arima.shape[0]-pred_len+1),pageviews_arima.shape[0]]
    pre_views_arima=pageviews_arima.head(pageviews_arima.shape[0]-pred_len)
    post_views_arima= pageviews_arima.tail(pred_len)
    stepwise_fit = auto_arima(pre_views_arima["views"], start_p=1, start_q=1,
                             max_p=3, max_q=3, m=12,
                             start_P=0,
                             d=1, D=1, trace=True,
                             error_action='ignore',  # don't want to know if an order does not work
                             suppress_warnings=True)  # set to stepwise
    
    fc_views =stepwise_fit.predict(n_periods=pred_len)
    plt.plot(pre_views_arima["views"])
    plt.plot(fc_views,c="red")
In [104]:
plot_auto_arima_rownr(1)
Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=4378.101, BIC=4398.938, Fit time=4.686 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 1, 0, 12); AIC=4886.870, BIC=4895.205, Fit time=0.125 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(1, 1, 0, 12); AIC=4649.461, BIC=4666.131, Fit time=1.546 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=4399.901, BIC=4416.571, Fit time=1.489 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 1, 12); AIC=4380.169, BIC=4405.174, Fit time=3.078 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 0, 12); AIC=4674.412, BIC=4691.082, Fit time=1.653 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 2, 12); AIC=4379.038, BIC=4404.043, Fit time=14.612 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 2, 12); AIC=4380.270, BIC=4409.443, Fit time=15.141 seconds
Fit ARIMA: order=(2, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=4376.369, BIC=4401.375, Fit time=5.177 seconds
Fit ARIMA: order=(2, 1, 0) seasonal_order=(0, 1, 1, 12); AIC=4420.162, BIC=4440.999, Fit time=1.325 seconds
Fit ARIMA: order=(2, 1, 2) seasonal_order=(0, 1, 1, 12); AIC=4380.447, BIC=4409.619, Fit time=3.626 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(0, 1, 1, 12); AIC=4498.087, BIC=4514.757, Fit time=0.923 seconds
Fit ARIMA: order=(3, 1, 2) seasonal_order=(0, 1, 1, 12); AIC=4367.765, BIC=4401.105, Fit time=7.745 seconds
Fit ARIMA: order=(3, 1, 2) seasonal_order=(1, 1, 1, 12); AIC=4371.302, BIC=4408.810, Fit time=8.642 seconds
Fit ARIMA: order=(3, 1, 2) seasonal_order=(0, 1, 0, 12); AIC=4644.915, BIC=4674.088, Fit time=1.313 seconds
Fit ARIMA: order=(3, 1, 2) seasonal_order=(0, 1, 2, 12); AIC=4380.351, BIC=4417.859, Fit time=20.232 seconds
Fit ARIMA: order=(3, 1, 2) seasonal_order=(1, 1, 2, 12); AIC=4371.515, BIC=4413.190, Fit time=21.892 seconds
Fit ARIMA: order=(3, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=4361.212, BIC=4390.384, Fit time=4.706 seconds
Fit ARIMA: order=(3, 1, 1) seasonal_order=(1, 1, 1, 12); AIC=4362.980, BIC=4396.320, Fit time=6.168 seconds
Fit ARIMA: order=(3, 1, 1) seasonal_order=(0, 1, 0, 12); AIC=4654.001, BIC=4679.006, Fit time=1.703 seconds
Fit ARIMA: order=(3, 1, 1) seasonal_order=(0, 1, 2, 12); AIC=4362.598, BIC=4395.938, Fit time=18.425 seconds
Fit ARIMA: order=(3, 1, 1) seasonal_order=(1, 1, 2, 12); AIC=4363.614, BIC=4401.122, Fit time=21.325 seconds
Fit ARIMA: order=(3, 1, 0) seasonal_order=(0, 1, 1, 12); AIC=4408.606, BIC=4433.611, Fit time=1.735 seconds
Total fit time: 167.307 seconds
In [108]:
plot_auto_arima_rownr(95856)
Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 1, 0, 12); AIC=7679.407, BIC=7687.742, Fit time=0.078 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(1, 1, 0, 12); AIC=7534.601, BIC=7551.272, Fit time=1.500 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=7287.148, BIC=7303.818, Fit time=3.814 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(1, 1, 1, 12); AIC=7277.707, BIC=7298.545, Fit time=3.952 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(1, 1, 0, 12); AIC=7523.374, BIC=7540.044, Fit time=2.140 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(1, 1, 2, 12); AIC=7291.141, BIC=7316.146, Fit time=11.486 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 0, 12); AIC=7639.158, BIC=7651.661, Fit time=0.685 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(2, 1, 2, 12); AIC=7283.071, BIC=7312.244, Fit time=16.384 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 1, 12); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(1, 1, 1, 12); AIC=7333.663, BIC=7350.333, Fit time=1.716 seconds
Fit ARIMA: order=(0, 1, 2) seasonal_order=(1, 1, 1, 12); AIC=7230.935, BIC=7255.940, Fit time=5.219 seconds
Fit ARIMA: order=(1, 1, 3) seasonal_order=(1, 1, 1, 12); AIC=7235.927, BIC=7269.267, Fit time=7.012 seconds
Fit ARIMA: order=(0, 1, 2) seasonal_order=(0, 1, 1, 12); AIC=7235.567, BIC=7256.405, Fit time=4.604 seconds
Fit ARIMA: order=(0, 1, 2) seasonal_order=(2, 1, 1, 12); AIC=7215.683, BIC=7244.856, Fit time=11.275 seconds
Fit ARIMA: order=(0, 1, 2) seasonal_order=(2, 1, 0, 12); AIC=7340.275, BIC=7365.280, Fit time=10.473 seconds
Fit ARIMA: order=(0, 1, 2) seasonal_order=(2, 1, 2, 12); AIC=7217.599, BIC=7250.939, Fit time=19.502 seconds
Fit ARIMA: order=(0, 1, 2) seasonal_order=(1, 1, 0, 12); AIC=7453.948, BIC=7474.786, Fit time=2.552 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(2, 1, 1, 12); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(2, 1, 1, 12); AIC=7254.206, BIC=7279.211, Fit time=13.218 seconds
Fit ARIMA: order=(0, 1, 3) seasonal_order=(2, 1, 1, 12); AIC=7214.907, BIC=7248.247, Fit time=14.829 seconds
Fit ARIMA: order=(0, 1, 3) seasonal_order=(1, 1, 1, 12); AIC=7225.086, BIC=7254.258, Fit time=6.920 seconds
Fit ARIMA: order=(0, 1, 3) seasonal_order=(2, 1, 0, 12); AIC=7335.340, BIC=7364.513, Fit time=12.831 seconds
Fit ARIMA: order=(0, 1, 3) seasonal_order=(2, 1, 2, 12); AIC=7275.150, BIC=7312.657, Fit time=19.157 seconds
Fit ARIMA: order=(0, 1, 3) seasonal_order=(1, 1, 0, 12); AIC=7436.326, BIC=7461.331, Fit time=4.533 seconds
Fit ARIMA: order=(1, 1, 3) seasonal_order=(2, 1, 1, 12); AIC=7216.287, BIC=7253.794, Fit time=19.296 seconds
Total fit time: 193.270 seconds
In [109]:
plot_auto_arima_rownr(108341)
Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=7765.392, BIC=7786.230, Fit time=4.786 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 1, 0, 12); AIC=8153.392, BIC=8161.727, Fit time=0.109 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(1, 1, 0, 12); AIC=7996.397, BIC=8013.068, Fit time=1.407 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=7786.285, BIC=7802.955, Fit time=2.166 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 1, 12); AIC=7766.787, BIC=7791.792, Fit time=5.540 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 0, 12); AIC=8053.552, BIC=8070.222, Fit time=0.970 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 2, 12); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 2, 12); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(2, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=7732.569, BIC=7757.574, Fit time=6.054 seconds
Fit ARIMA: order=(2, 1, 0) seasonal_order=(0, 1, 1, 12); AIC=7735.643, BIC=7756.480, Fit time=3.258 seconds
Fit ARIMA: order=(2, 1, 2) seasonal_order=(0, 1, 1, 12); AIC=7679.053, BIC=7708.226, Fit time=5.256 seconds
Fit ARIMA: order=(3, 1, 3) seasonal_order=(0, 1, 1, 12); AIC=7755.533, BIC=7793.041, Fit time=8.078 seconds
Fit ARIMA: order=(2, 1, 2) seasonal_order=(1, 1, 1, 12); AIC=7681.727, BIC=7715.067, Fit time=6.735 seconds
Fit ARIMA: order=(2, 1, 2) seasonal_order=(0, 1, 0, 12); AIC=7970.309, BIC=7995.314, Fit time=1.548 seconds
Fit ARIMA: order=(2, 1, 2) seasonal_order=(0, 1, 2, 12); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(2, 1, 2) seasonal_order=(1, 1, 2, 12); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 1, 1, 12); AIC=7677.368, BIC=7702.373, Fit time=5.158 seconds
Fit ARIMA: order=(1, 1, 3) seasonal_order=(0, 1, 1, 12); AIC=7679.056, BIC=7708.229, Fit time=6.466 seconds
Fit ARIMA: order=(2, 1, 3) seasonal_order=(0, 1, 1, 12); AIC=7681.878, BIC=7715.218, Fit time=7.201 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(1, 1, 1, 12); AIC=7678.943, BIC=7708.116, Fit time=6.125 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 1, 0, 12); AIC=7969.155, BIC=7989.993, Fit time=0.908 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 1, 2, 12); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(1, 1, 2, 12); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(0, 1, 2) seasonal_order=(0, 1, 1, 12); AIC=7730.764, BIC=7751.601, Fit time=2.922 seconds
Total fit time: 74.753 seconds

The results are not too bad, actually. Especially the lower left plot. We even got a downturn in the upper left plot. The upper right plot is a challenging problem, because the levelling of the viewer numbers at the end of the time range was not predictable from the previous behaviour. The same is true for the large spike in the lower right plot.

Given that it’s a fully automatic forecast (assuming only weekly periodicities) the auto.arima tool performs decently and provides us with a useful baseline to compare other methods to.

6.2 Prophet - Section currently under maintenance

Prophet is an open-source time series forecasting tool developed by Facebook. It is implemented in an R library, and also a Python package (as already shown in this competition).

Prophet works as an additive regression model which decomposes a time series into (i) a (piecewise) linear/logistic trend, (ii) a yearly seasonal component, (iii) a weekly seasonal component, and (iv) an optional list of important days (such as holidays, special events, …). It claims to be “robust to missing data, shifts in the trend, and large outliers”, which would make it well suited for this particular task

6.2.1 Basic performance

First, let’s test the tool:

In [110]:
dateView= pd.DataFrame(columns={'dates','views'})
rownr=139120
pageviews= extract_ts(rownr)
pageviews["dates"] = pd.to_datetime(pageviews['dates'])
pageviews.rename(columns={'dates':'ds',
                          'views':'y'}, inplace=True)
pred_len=60
pred_range=[(pageviews.shape[0]-pred_len+1),pageviews.shape[0]]
pre_views=pageviews.head(pageviews.shape[0]-pred_len)
post_views= pageviews.tail(pred_len)
In [111]:
from fbprophet import Prophet

A few notes about the practical workings of prophet:

  • data format: prophet expects a data frame with two columns: ds, y. The first one holds the dates, the second one the time series counts.

  • parameter changepoint.prior.scale adjusts the trend flexibility. Increasing this parameter makes the fit more flexible, but also increases the forecast uncertainties and makes it more likely to overfit to noise. The changepoints in the data are automatically detected unless being specified by hand using the changepoints argument (which we don’t do here).

  • parameter yearly.seasonality=TRUE has to be enabled explicitely and allows prophet to notice large-scale cycles. The importance of this parameter is explored further below.

This is the standard prophet forecast plot:

In [112]:
m = Prophet()
m.fit(pageviews)
future = m.make_future_dataframe(periods=pred_len)
future.tail()
forecast = m.predict(future)
INFO:fbprophet.forecaster:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
C:\Users\Admin\Anaconda3\lib\site-packages\pystan\misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
In [113]:
m.plot(forecast)
Out[113]:

The observed data are plotted as black points and the fitted model, plus forecast, as a blue line. In light blue we see the corresponding uncertainties.

Prophet offers a decomposition plot, where we can inspect the additive components of the model: trend, yearly seasonality, and weekly cycles:

In [114]:
fig2 = m.plot_components(forecast)

We see that prophet recovers the weekly variation pattern we had extracted by hand in the previous section. This is a useful consistency check. The seasonal variability suggests an overall decline in views towards the middle of the year.

Being the ggplot2 freaks that we are, we decide to visualise our forecast in a different way that gives us more control over the output:

In [115]:
 ggplot(forecast,aes("ds", "yhat")) + geom_ribbon(aes(x = "ds", ymin = "yhat_lower", ymax = "yhat_upper"), fill = "lightblue") + geom_line(colour = "#ADD8E6") +geom_line(pre_views, aes("ds", "y"), colour = "black") +geom_line(post_views, aes("ds", "y"), colour = "grey")
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[115]:
<ggplot: (190652677468)>

Here we plot the observed data as black line, our hold-out set as a grey line, and the forecast plus uncertainties in blue and light blue, again. This shows us immediately how our model is performing, and in this case it’s not doing badly.

We turn this ggplot2 version into a plotting function and use it to predict a couple of sample time series. We also include the seasonality parameter (TRUE/FALSE) as a second input:

In [116]:
def plot_prophet_rownr_season(rownr, season):
    art=tpages['article'][rownr]
    loc=tpages['locale'][rownr]
    acc=tpages['access'][rownr]
    pageviews= extract_ts(rownr)
    pageviews["dates"] = pd.to_datetime(pageviews['dates'])
    pageviews.rename(columns={'dates':'DS',
                          'views':'Y'}, inplace=True)
    pred_len=60
    pred_range=[(pageviews.shape[0]-pred_len+1),pageviews.shape[0]]
    pre_views=pageviews.head(pageviews.shape[0]-pred_len)
    post_views= pageviews.tail(pred_len)
    m = Prophet(changepoint_prior_scale=0.5, yearly_seasonality=season)
    m.fit(pageviews)
    future = m.make_future_dataframe(periods=pred_len)
    future.tail()
    forecast = m.predict(future)
    return  ggplot(forecast,aes("ds", "yhat")) + geom_ribbon(aes(x = "ds", ymin = "yhat_lower", ymax = "yhat_upper"), fill = "lightblue") + geom_line(colour = "#ADD8E6") +geom_line(pre_views, aes("ds", "y"), colour = "black") +geom_line(post_views, aes("ds", "y"), colour = "grey")
In [117]:
plot_prophet_rownr_season(70772, False)
INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
C:\Users\Admin\Anaconda3\lib\site-packages\pystan\misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[117]:
<ggplot: (-9223371846202401911)>

Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

In [118]:
plot_prophet_rownr_season(108341, True)
INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
C:\Users\Admin\Anaconda3\lib\site-packages\pystan\misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[118]:
<ggplot: (-9223371846202024658)>

Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

In [119]:
plot_prophet_rownr_season(95856, True)
INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
C:\Users\Admin\Anaconda3\lib\site-packages\pystan\misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[119]:
<ggplot: (190652890302)>

Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

In [120]:
plot_prophet_rownr_season(139120, True)
INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
C:\Users\Admin\Anaconda3\lib\site-packages\pystan\misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[120]:
<ggplot: (190652536717)>

Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

6.2.2 The importance of seasonal variations

Enabling prophet to recognise long-term seasonal variations in the data is crucial for a successful forecasting of our time series data. To demonstrate this, below I’m plotting the following two sample curves: the German main page and the entry for Oxygen in the Spanish wikipedia (many thanks to MuonNeutrino for flagging this time series in their great exploratory kernel):

In [121]:
plot_prophet_rownr_season(72480, False)
INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
C:\Users\Admin\Anaconda3\lib\site-packages\pystan\misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[121]:
<ggplot: (-9223371846202655523)>

Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

Initial log joint probability = -12.7547 Optimization terminated normally: Convergence detected: absolute parameter change was below tolerance

In [122]:
plot_prophet_rownr_season(72480, True)
INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
C:\Users\Admin\Anaconda3\lib\site-packages\pystan\misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[122]:
<ggplot: (190652683398)>

Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

Initial log joint probability = -12.7547 Optimization terminated normally: Convergence detected: relative gradient magnitude is below tolerance

In [123]:
plot_prophet_rownr_season(139120, False)
INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
C:\Users\Admin\Anaconda3\lib\site-packages\pystan\misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[123]:
<ggplot: (190652110733)>

Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this. Initial log joint probability = -2.84109 Optimization terminated normally: Convergence detected: relative gradient magnitude is below tolerance

In [124]:
plot_prophet_rownr_season(139120, True)
INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
C:\Users\Admin\Anaconda3\lib\site-packages\pystan\misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4384: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  object.__getattribute__(self, name)
C:\Users\Admin\Anaconda3\lib\site-packages\pandas\core\generic.py:4385: FutureWarning: Attribute 'is_copy' is deprecated and will be removed in a future version.
  return object.__setattr__(self, name, value)
Out[124]:
<ggplot: (-9223371846202233340)>

Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this. Initial log joint probability = -2.84109 Optimization terminated normally: Convergence detected: relative gradient magnitude is below tolerance

The upper row of plot shows forecasts without a seasonal component vs the presence of this component in the lower row. We can clearly see that the seasonal forecasts predict the real time series evolution much better than the others. A seasonal component should be included in a successful prophet model for this project.

Thanks for reading this exploration! I’m grateful for all the upvotes and the great feedback.

Have fun!